Why Yann LeCun Thinks LLMs Are Fundamentally Wrong (And What He Wants Instead)
Yann LeCun — Turing Award winner, inventor of CNNs, Chief AI Scientist at Meta — argues that LLMs are fundamentally incapable of genuine intelligence. He's not a doomer or a cheerleader. He's someone with the technical depth to say uncomfortable things. Here's what he actually believes, what he's building instead, and why it matters for how we think about knowledge and learning.
Why Yann LeCun Thinks LLMs Are Fundamentally Wrong (And What He Wants Instead)
Every few months, someone declares that large language models have crossed a threshold. They can pass the bar exam. They can write code. They can reason through complex problems. And every few months, Yann LeCun looks at the evidence and says: you're missing something.
LeCun is not a doomer. He doesn't think AI will destroy humanity. He is also not a cheerleader. What he is, is one of the few researchers in the world with the technical depth and institutional positioning to say uncomfortable things out loud — and mean them.
Who Is Yann LeCun?
Yann LeCun shared the 2018 Turing Award with Geoffrey Hinton and Yoshua Bengio — the trio often called the godfathers of deep learning. He invented convolutional neural networks, the architecture behind every major image recognition system of the last decade. He is Chief AI Scientist at Meta. He has been in this field longer than most of its current practitioners have been alive.
So when he says that large language models cannot lead to genuine intelligence, it is worth understanding exactly what he means.
The Core Argument
LeCun's critique of LLMs is specific and structural. The dominant paradigm — autoregressive next-token prediction — trains models to predict the next piece of text given everything before it. Do this at massive scale, across most of the internet, and you get systems that produce fluent, coherent, often impressive text.
But LeCun argues this process trains a fundamentally different thing than what we call understanding. It trains a model of text, not a model of the world.
The distinction matters. Text is a lossy, abstract representation of reality. It encodes what humans chose to write down, which is not the same as how reality works. An LLM learns the statistical regularities of that representation — what words tend to follow other words, what arguments tend to follow what premises, what answers tend to follow what questions. It does this extraordinarily well.
What it does not do — cannot do, LeCun argues — is model the underlying structure of reality that the text is about.
He points to several recurring failure modes:
- Physical reasoning: Ask an LLM what happens when you push a glass to the edge of a table, and it will frequently hedge, confabulate, or simply be wrong.
- Hallucinations: They confidently make things up because they're optimizing for plausible text, not for truth. The training objective never required truth, only coherence.
- Planning limits: They cannot reliably perform multi-step planning that requires tracking the real consequences of actions in sequence.
- Static knowledge: Their knowledge is frozen at training time — they do not update from experience the way a living system does.
From LeCun’s perspective, these are not just engineering bugs. They are symptoms of a deeper architectural mismatch between next-token prediction and genuine understanding.
What He Proposes Instead: JEPA and World Models
LeCun’s alternative vision centers on two ideas:
- World models — internal models of how the world works that support prediction, planning, and counterfactual reasoning.
- Joint Embedding Predictive Architecture (JEPA) — a way to learn those models without predicting raw tokens or pixels.
Why Not Predict Tokens or Pixels?
Predicting in token space (for language) or pixel space (for images and video) is high-dimensional and noisy. There are infinitely many ways a video frame could look; infinitely many ways a sentence could continue. Training a model to predict in that raw space forces it to model enormous amounts of irrelevant variation.
JEPA takes a different route:
- Instead of predicting raw tokens or pixels, the model predicts in abstract embedding space — a compressed representation of the world's underlying structure.
- The model learns what is invariant and meaningful across different surface representations, rather than memorizing surface statistics.
- The result is a system that aims to encode the structure of reality, not just the statistics of how humans write about it.
In a JEPA-style setup, the system learns to map different views or segments of the same underlying situation into a shared representation space and to predict one representation from another. The hope is that this forces the model to capture the latent causal and physical structure that generates the observations.
Grounding: Why Text Alone Isn’t Enough
The other pillar of LeCun’s critique is grounding. He argues that genuine understanding requires sensory and physical experience, not just exposure to text.
A language model learns the word "heavy" by processing sentences that use the word "heavy." A child learns heavy by picking up objects, feeling resistance, watching things fall. These are fundamentally different kinds of learning.
This is why video occupies a central place in LeCun's vision:
- A baby learns physics not by reading a physics textbook but by pushing things, dropping things, watching objects move and collide.
- The statistics of physical reality are embedded in video in a way that text cannot capture — the weight of objects, the behavior of liquids, the arc of thrown things.
- Grounding AI in video and sensory data, rather than text alone, is what LeCun believes is required for any system that will genuinely understand the world.
World models are the integration point: internal simulations of how the world works that allow an agent to plan, predict, and reason about counterfactuals. Not "what text comes next" but "what happens if I do this."
The Fair Counterargument
LeCun's critics are numerous and credible. The emergent reasoning capabilities of frontier LLMs have surprised researchers who argued such capabilities were impossible at any scale. Chain-of-thought prompting, extended reasoning chains, and tool use suggest that something more structured than pure pattern matching may be happening inside these systems.
The honest position is that we don't fully know. Interpretability research isn't there yet. Whether LLMs are doing something that genuinely deserves to be called reasoning — or something that merely resembles it from the outside — is an open empirical question.
But LeCun's directional bet is interesting regardless of who turns out to be right:
- If he's wrong, the emergent reasoning is real and we'll understand why in a few years.
- If he's right, the current scaling trajectory hits a ceiling — and the path forward requires something closer to what he is describing.



