From Text to Video: How AI Understanding Is Evolving
The assumption that text is the primary medium of intelligence is starting to crack. What changes when AI learns from video, audio, and experience — not just language?
Most AI development has been built on a quiet, rarely examined assumption: language is the primary medium of intelligence. Feed a model enough text, and it will develop something that looks like understanding. GPT-4, Claude, Gemini — these are all, at their core, text-trained systems. They are extraordinarily capable. They are also built on a foundation that has a ceiling.
That ceiling is starting to become visible.
The Text Assumption
Text is how humans communicate knowledge — but it is not how we acquire it. Yann LeCun has made this argument forcefully: we learn by experiencing the world, not by reading about it. Language is a lossy compression of lived, embodied experience. It is the output of understanding, not its source.
A child does not learn what “falling” means by reading the word. She learns by dropping things, by tumbling herself, by watching objects arc through the air and hit the floor. Text models inherit the distilled residue of millions of such experiences, abstracted into language — but they miss the raw signal underneath.
What Text Cannot Capture
The gaps become clear when you probe for specifics. Spatial reasoning — “the cup is to the left of the plate” — is trivial to say and genuinely hard to understand without spatial grounding; text models can repeat the pattern, but whether they have the underlying geometry is far less clear. Physical causality lives in video, not prose: the motion of a falling cup, the shatter, the spray of liquid encode physics that a written sentence can only point at. Procedural knowledge is the starkest case — a written description of how to tie a bowline knot is nearly useless compared to a thirty-second video, because the hands know something the words don’t. Emotional nuance disappears almost completely in text: tone, micro-expression, the slight pause before an answer require voice and face to survive.
These are not edge cases. They represent entire categories of knowledge that text models handle through pattern matching rather than genuine grounding.
What Video Models Actually Learn
The arrival of Sora, Gemini 1.5 Pro’s million-token video context, and GPT-4V marks more than an incremental capability upgrade. These systems are learning from a fundamentally richer signal.
When a video model trains on footage of a glass tipping off a table and shattering, it does not receive a text description of shattering. It receives pixel-level temporal data: how momentum transfers, how glass fractures along stress lines, how liquid spreads according to surface tension and gravity. The model learns the physics not from Newton’s laws written in prose, but from the thing itself, frame by frame.
This distinction matters enormously. Sora, OpenAI’s video generation model, demonstrates an implicit grasp of Newtonian mechanics that emerges not from textbook training but from the structure of video itself — when asked to generate footage of a bouncing ball, it produces appropriate squash-and-stretch deformation, correct arc trajectories, and plausible energy dissipation, behaviors that were never explicitly programmed but fell out of learning from real-world video data.
Gemini 1.5 Pro takes a different approach: rather than generating video, it reasons over it. Feed it an hour of a complex surgical procedure, and it can identify the instruments used, describe what each step accomplishes, flag moments where technique deviates from standard practice, and answer detailed follow-up questions about specific timestamps. This is temporal reasoning — understanding not just what is happening, but when, in what sequence, and why the sequence matters.
GPT-4V adds visual grounding to language reasoning: show it a circuit diagram and ask why it might fail; show it a screenshot of a software error and ask what’s causing it; show it a whiteboard architecture diagram and ask whether it would scale. These are tasks that pure text models cannot approach, because they require reading a visual artifact rather than processing language.
What these systems share is the ability to reason over time and space, not just over tokens. A text model reads left to right through a sequence of symbols. A video model reads through a sequence of frames, each frame a rich two-dimensional space, the whole sequence encoding causal relationships, object persistence, physical dynamics, and procedural structure. It is a fundamentally different kind of learning.
The Benchmarks That Reveal the Gap
The performance gap between text-only and video-capable models shows up most starkly on tasks that require grounded physical understanding.
On the PIQA (Physical Intuition Question Answering) benchmark, which tests commonsense reasoning about physical interactions, text models plateau at around 80–83% accuracy. Models trained on or fine-tuned with video data show measurable improvements on the physical reasoning subset, particularly on questions involving object manipulation, trajectory prediction, and fluid behavior. The questions that trip up text models most reliably are exactly the ones that would be trivially obvious to anyone who has watched a video of the thing in question.
The Something-Something dataset, developed by TwentyBN, is perhaps the most revealing benchmark of all. It contains 220,000 videos of humans performing basic physical interactions — pushing objects, picking things up, covering one thing with another. The task is to classify what is happening. Text models, given transcripts, perform well below chance on fine-grained distinctions (“moving something towards the camera” vs. “moving something away from the camera,” “pushing something so it almost falls” vs. “pushing something so it falls off the edge”). Video models trained directly on the footage reach well above 70% accuracy on these same fine-grained distinctions.
The gap is not small. It is not a matter of fine-tuning. It reflects a genuine difference in what the two types of model have learned.
On procedural Q&A — tasks where a model is asked to answer questions about how to do something based on instructional video — text models are severely limited when the knowledge does not exist in textual form. The CrossTask dataset measures whether a model can identify the correct step sequence for tasks like “make a latte” or “replace a car tire” from video. Text models score near chance when the video contains actions that are not verbally narrated. Video models, processing the visual stream directly, score dramatically higher because they can observe what is happening rather than parse what is being said about it.
Where video models still struggle: long-horizon reasoning over very long videos (more than 30 minutes), fine-grained object recognition in cluttered scenes, and causal reasoning that requires integrating information from non-contiguous timestamps. These are real limitations, and the field is actively working on them.
Why This Matters for Knowledge Work
The knowledge worker’s world is full of things that are extremely difficult to document in text.
Consider the experience of onboarding onto a complex codebase. Senior engineers do not typically write detailed prose descriptions of how the system works; they walk new hires through it, often by sharing their screen — showing which files to look at, demonstrating how to trigger a particular behavior, pointing at what matters and narrating in real time. The video of that session is worth more than the document that would take five times as long to write and would still fail to convey the relevant context.
The same pattern repeats across knowledge work: a designer walks a client through a prototype narrating tradeoffs as they go; a DevOps engineer debugs a Kubernetes cluster misconfiguration by poking around live, talking through what they’re seeing; a data analyst explains a model’s output by walking through the notebook step by step. In all of these cases, the video of the session contains orders of magnitude more information than any written summary could capture.
Knowledge management systems have been built almost entirely around text — wikis, documentation, runbooks, internal blogs — partly because text was easier to index, search, and link, and partly because no tool existed for structuring the information locked inside video. The result is that the most important knowledge in most organizations is locked in people’s heads, or in recordings that no one watches because no one can find the right moment in them.
AI that can genuinely understand video changes this calculus completely.
From Video Understanding to Procedural Intelligence
The most consequential emerging capability is what might be called procedural AI: systems that can watch a recording of someone doing something and extract the underlying procedure — the steps, their sequence, the decision points, the variations.
This is already beginning to work. Researchers have demonstrated models that can watch a cooking video and produce a structured recipe with accurate ingredient quantities, step ordering, and timing. The same approach has been applied to software tutorials: a model watches a screen recording of someone configuring a tool, then generates a step-by-step guide that matches what was actually done rather than what a documentation writer thought should be done.
The power here is not just efficiency. It is fidelity. Text documentation of procedures is notoriously brittle — it describes an idealized version of the process, written by someone who knows it well enough to skip the parts that feel obvious, and almost always fails at exactly the step where things go wrong for beginners. A model that learns from watching many people perform a procedure — including the failed attempts, the error messages, the moments of confusion — learns something much closer to the actual lived experience of doing the thing.
This creates the foundation for a new kind of expert system — not one built by having domain experts write down their knowledge, which is slow, incomplete, and always lagging, but one built by ingesting the recordings of experts doing their work and extracting the knowledge automatically. Every time a support engineer resolves a complex issue while sharing their screen, every time a senior developer reviews a pull request on a video call, every time a surgeon demonstrates a technique — the knowledge is there, in the video, waiting to be extracted.
The Documentation Revolution Coming
The 45-minute setup video is one of the great knowledge-transfer failures of the software industry. Someone who knows a system records themselves configuring it, uploads it, and immediately the video is out of date in at least three places — impossible to search, with five minutes of critical information buried somewhere in the middle and the viewer forced to watch the entire thing to find out whether what they need is even in there.
This is so common it has become normalized. It should not be.
AI video understanding closes this gap. A model that can parse a setup video can produce a timestamped summary, chapter markers, a searchable transcript accurate to what was actually demonstrated rather than what was said, a structured list of steps with prerequisites and dependencies, and answers to the questions a viewer is most likely to have.
This is not speculative — the component technologies exist today. What does not yet exist is their integration into the knowledge management tools that organizations actually use. That integration is coming, and it will change what documentation means.
The broader vision is a knowledge graph built from video: every recorded meeting, tutorial, demonstration, and onboarding session becomes a source of structured, searchable, linkable knowledge. A new hire can ask “how do we handle authentication in the API?” and receive a synthesized answer drawn from the ten most relevant recordings, with timestamps and the relevant procedure extracted as steps.
Text documentation ages. Video documentation ages too — but AI-extracted procedural knowledge can be versioned, updated, and maintained in ways that raw video cannot.
What Multimodal Learning Paths Look Like
If the problem with text is that it is the wrong medium for certain kinds of knowledge, the solution is not to replace text with video. It is to match the medium to the concept.
The future of learning is not text courses or video courses. It is dynamically composed paths that use the right medium for each element of what needs to be learned.
Abstract concepts — what a function is, what a derivative measures, what a P-value means — are often best communicated through well-crafted text, because the structure of language maps well onto the structure of abstract thought. Spatial and physical concepts — how a cam shaft works, what a gradient descent landscape looks like, how light refracts through a lens — are best communicated visually, because a thirty-second animation beats three paragraphs of description when the concept lives in visual space. Procedural skills — how to use a tool, how to perform a technique, how to navigate a complex workflow — are best communicated through demonstration, because watching someone do the thing, with narration at the moments of decision, is the native medium of procedural knowledge transfer.
A learning platform built on this principle would not simply offer videos alongside text. It would understand the structure of what is being taught, identify which components are abstract, spatial, or procedural, and surface the right medium for each — adapting dynamically, skipping the conceptual walkthrough for learners who have already demonstrated mastery, offering the animated version when someone is struggling with spatial intuition.
This is the promise of multimodal AI in education: not just more content in more formats, but intelligent composition of the right content in the right format for the right learner at the right moment.
A New Foundation for Structured Learning
The thread running through all of this is structure. The reason text has dominated knowledge management is that text is already structured: it can be chunked, indexed, searched, linked, versioned, and recombined. Video, until recently, was a black box — you could store it, you could play it back, but you could not reason over its contents without watching it from beginning to end.
AI video understanding brings video into the structured knowledge ecosystem for the first time. It makes video contents searchable. It makes procedures extractable. It makes demonstrations referenceable. It makes the entire corpus of recorded human knowledge — which is vast and mostly locked in video — available as input to learning systems.
SILKLEARN is being built for exactly this world: one where knowledge arrives in every medium — text, video, audio, diagrams — but where the job of a learning platform is to impose structure, surface the right piece at the right moment, and compose paths that honor how people actually learn. The shift from text-only to multimodal AI understanding is not a feature addition. It is a change in what is possible.



