There are two ways to measure complexity, and this article argues that LLMs only handle one of them.
Shannon entropy
so the first way to measure complexity is Shannon entropy. Basically how surprised are you by the next piece of data? The example they gave was , every digit is a surprise. By this measure, is maximally complex.
This is what LLMs optimize for. When a model trains on text, it's learning to minimize cross-entropy loss.
Kolmogorov complexity
But here's the thing. is also one of the simplest objects in math. You can generate every single digit with:
Kolmogorov complexity measures this, how short is the program that produces the output? For , it's quite short, so you have this situation where the same object is maximally complex by one measure and trivially simple by another.
This is the key insight of the whole article. LLMs are Shannon machines. they learn the statistical patterns of outputs. They don't learn the underlying program that generates those outputs.
the Einstein example
The article uses Einstein to make this concrete. In 1887, an experiment showed the speed of light was the same in every direction, which broke the physics of the time. Many physicists responded to that experiment by "adding parameters" to make the math fit without questioning the model itself. This is basically overfitting.
Einstein then asked "what kind of universe would make this result obvious?" and from that one question he got special relativity, then eventually general relativity. That's Kolmogorov complexity in action. An LLM trained on all the astronomy data before 1905 could predict orbits well, but it would never ask "what if the whole framework is wrong?" (maybe it will eventually though)
Judea Pearl's ladder of causation
Next is the idea of Judea Pearl's ladder of causation.
Pearl says there are three levels of reasoning:
- association: what patterns exist in the data? This is where LLMs live.
- intervention: what happens if I change something? This needs a causal model, not just correlations.
- counterfactual: what would have happened if things were different?
Einstein was at level 3. "what would the universe have to be like for light speed to be constant?" That's a counterfactual question. Pearl provedthat you can't get to higher levels just by collecting more data at a lower level.
So when people say "just make the model bigger and it'll get there", I think this framework suggests that's not how it works. it's not a scale problem. it's a structural one.
Don Knuth's story
Don Knuth recently wrote about Claude Opus 4.6 solving an open problem, he'd been working on for weeks which was decomposing the arcs of a certain 3D digraph into three Hamiltonian cycles for all odd . I think this story is the perfect example of both the power and the limit of LLMs right now.
His friend Filip Stappers posed the problem and told Claude to write its progress to a plan.md file after every exploration run before doing anything else.
Something interesting is Claude's approach over 31 explorations. It started by trying brute-force DFS (too slow), then discovered a "serpentine pattern" in 2D, then tried fiber decompositions, then simulated annealing, hit dead ends, tried coordinate rotations that almost worked (only 3(m-1) conflicts out of vertices), and eventually circled back to study a previous SA solution more carefully. At exploration 31 it noticed the choice at each fiber depends on only a single coordinate and produced a Python program that gave valid decompositions for . Stappers tested it up to and it worked every time.
Knuth called the strategy "quite admirable" and said Claude had "deduced where to look."
Then three things happened that I think matter:
- when Stappers pushed Claude to the even case (), Claude got stuck and eventually couldn't even write correct exploration programs anymore. The session just degraded.
- after Claude found the construction, Knuth had to sit down and prove why it works. Claude found the pattern, Knuth found the mechanism.
- Later, GPT-5.4 Pro produced a 14-page proof for the even case. And another researcher found a simpler construction for odd by bouncing ideas between GPT and Claude together. The LLMs are very powerful tools but they need humans directing them and other LLMs complementing them.
Conclusion
I think for 99% of practical tasks, LLMs are REALLY good at pattern-matching. Shannon-level reasoning handles all of that.
But I think the article is right that there's a potential ceiling, and it's not one you can break through with more parameters or more data. The question "what kind of universe would produce this data?" is fundamentally different from "what data comes next?".
We'll probably see completely new architectures. I genuinely don't know.