Sequential Structure All the Way Down
On formal grammars, cognitive evolution, and why a sequence benchmark might be the lens computational neuroscience needs
Introduction: Lashley’s Problem Is Everyone’s Problem
In 1951, Karl Lashley delivered a paper that should have reoriented neuroscience. “The Problem of Serial Order in Behavior” argued that the coordination of sequences — of vocal movements, of words in a sentence, of sentences in discourse, of muscular contractions in reaching — requires hierarchical plans organized by the brain. Not associative chains. Not stimulus-response pairs. Internal structure.
Lashley saw what behaviorism refused to acknowledge: that serial order is a problem that cuts across every domain of cognition. Not just language. Speech production. Motor planning. Goal hierarchies. Musical performance. Even something as mundane as making coffee requires nested sub-goals executed in the right order. As Gregory Hickok puts it in Wired for Words, Lashley’s insight was that the problems raised by the organization of language are characteristic of almost all other cerebral activity.
Seven decades later, we’re still addressing Lashley’s problem and often appear to struggle to accommodate it into our models of cortical function.
Here’s the thread. For several months, and part of a longer, ongoing research effort, we have been developing SymSeqBench, a framework for generating and analyzing symbolic sequences with formally controlled complexity. The goal is to provide computational neuroscience with the tools it needs to systematically probe sequence learning — across species, across implementations, across domains. What started as a benchmarking tool (and a meta-review, to come later) has become, increasingly, a lens for seeing something bigger: that the formal structure of sequential computation might be the best metric we have for understanding cognitive complexity and its evolution.
Let me walk you through how I got there and where this is leading me.
A Costly Fragmentation
A psycholinguist studying artificial grammar learning, a computational neuroscientist modeling temporal credit assignment, a neuromorphic engineer benchmarking spiking networks, or an LLM developer are all studying the same underlying problem: sequential structure. How systems — biological or artificial — learn to process, predict, and produce ordered sequences.
But you’d never know it from their methods. Every subfield uses its own task generation pipeline, its own complexity metrics, its own idiosyncratic stimuli. The psycholinguist can’t use the neuromorphic benchmark. The computational modeler can’t compare results to behavioral data. “Non-adjacent dependency task” means different things to different researchers and it’s even called by different names. We’re doing parallel play, not science.
This is why we built SymSeqBench. The framework combines two components: SymSeq (for generating and analyzing rule-based symbolic sequences using formal language theory) and SeqBench (for transforming those sequences into embedded datasets with controllable complexity). The critical design choice was grounding everything in formal foundations — the Chomsky hierarchy, specifically — rather than ad-hoc task design.
Why does this matter? Because formal language theory gives us a taxonomy of computational complexity that isn’t arbitrary. We know, provably, what separates regular from context-free languages. We know what memory architectures each level requires. Starting from theory means you’re not guessing about what makes your task difficult; you’re manipulating known structural properties.
The framework uses topological entropy as its primary complexity measure, computed via spectral analysis of grammar transition structures. This gives you a continuous metric that correlates with learning difficulty, letting you tune task complexity smoothly rather than jumping between qualitatively different tasks. And it operates at four analysis scales — token, string, string-set, grammar — because sequence learning is hierarchically organized. A model that nails local transition statistics but fails at long-range dependencies has learned something very different from one that captures grammar-level structure, and you need tools that can tell the difference.
The bridge between formal theory and empirical testing is exactly what’s been missing.
The Chomsky Hierarchy Is Not About Language
Here’s the conceptual move that reframed my thinking. Most neuroscientists encounter the Chomsky hierarchy in a linguistics context and promptly forget it (or never encounter it at all). That’s a mistake. The hierarchy isn’t really about language. It’s about what classes of computation require what kinds of memory architecture.
Four levels. At the bottom, regular grammars — processable by finite-state automata with no external memory. One step up, context-free grammars, requiring a pushdown automaton: a finite-state machine augmented with a stack. Above that, context-sensitive grammars needing a linear-bounded tape. At the top, recursively enumerable languages requiring a full Turing machine.
The critical boundary, the one that matters for neuroscience, is between regular and context-free. This is where computation goes from “I can track what state I’m in” to “I can remember where I’ve been and return there in order.” It’s the boundary between correlation and structure. Between statistics and rules.
Consider center-embedded sentences: “The rat the cat the dog chased killed ate the malt.” A finite-state machine — regardless of how many states you give it — cannot enforce the structural constraint that the number of subjects matches the number of verbs. A high-order Markov model with V^k states (where V is vocabulary size and k is dependency length) can approximate this statistically for short sequences, but it cannot enforce it. This is provable, not empirical.
And this creates what I’ve started calling the “Markov illusion” — the belief that sufficiently powerful statistical models can substitute for structural computation. It’s the same illusion that makes large language models seem to understand grammar. The illusion breaks precisely when you test generalization to unseen depths or novel compositions. With SymSeqBench, we’ve seen this failure across every architecture we’ve tested: even within regular grammars, systematic generalization fails. The systems learn position-specific patterns, not abstract structure.
If artificial neural networks can’t do this without custom augmentations (see Deletang et al., 2023), how does biological tissue?
Mapping Cognitive Evolution Through Grammar Complexity
This is where it gets genuinely exciting. A recent paper by Klein and Barron (2024) — “Comparing cognition across major transitions using the hierarchy of formal automata” — makes the argument for an idea I’ve also been trying to explore: that the Chomsky hierarchy can serve as a map for the evolution of cognitive complexity.
Their framework identifies major cognitive transitions, each corresponding to a jump in the formal automaton hierarchy:
Nets to Centralization: Distributed nerve nets (cnidarians) to centralized nervous systems. No change in formal class — still reactive, still finite-state — but centralization enables faster, coordinated processing.
Centralization to Recurrence: The emergence of recurrent connections, enabling temporal integration and memory. This is the transition from purely feedforward to recurrent processing — from stimulus-bound to context-dependent.
Recurrence to Lamination: Layered cortical structures enabling hierarchical processing and increasingly abstract representations.
Lamination to Reflection: The emergence of metacognition, self-monitoring, recursive thought.
At each transition, the computational repertoire expands. The organism can handle more complex sequential structures — deeper dependencies, longer-range correlations, more hierarchical nesting.
What makes this testable? If you can characterize the grammar complexity a species can handle behaviorally, you can place it on this cognitive map. We’ve already started doing this — analyzing cross-species behavioral sequences through the multi-scale analysis pipeline. The preliminary results are striking: mouse grooming sequences show lower syntactic complexity than zebrafish or finch vocalizations. Seals and turtles cluster at intermediate levels. The ordering isn’t what you’d naively predict from “brain size” or phylogenetic distance, which suggests the formal complexity metric is capturing something the traditional metrics miss. Deeper conclusions, however, would warrant a more systematic investigation.
Critical Transition Thresholds
There’s a parallel line of work that converges on the same idea from a completely different direction. Assembly Theory, developed by Lee Cronin, Sara Walker, and colleagues (Sharma et al., 2023), measures the complexity of molecular objects by their “assembly index” — the minimum number of joining operations needed to construct the object from basic building blocks. Their key finding: an assembly index above 15 reliably distinguishes molecules produced by living systems from those formed abiotically. It’s a complexity threshold that marks the transition from chemistry to biology.
The formal connection to our story is this: assembly indices are related to the descriptional complexity of formal grammars. The assembly process — recursive combination of sub-assemblies — maps onto context-free grammar production rules. The critical threshold is a transition in the complexity class of the generative process.
Klein and Barron’s cognitive transitions and Cronin’s molecular transitions are, in a sense, the same phenomenon observed at different scales: discontinuities in the complexity of rule structures that a system can generate and process mark qualitative transitions in the system’s nature. In molecules, you get life. In nervous systems, you get cognition. And the formal framework for measuring both is the same.
A recent empirical test by Voudouris et al. (2025) added computational weight to this picture. They tested artificial neural networks on tasks spanning the Chomsky hierarchy and found that the critical architectural transition is from feedforward to recurrent processing — mirroring what Klein and Barron predicted for biological nervous systems. The performance gap between architectures widens sharply at higher levels of the hierarchy, confirming that architectural transitions correspond to genuine computational capability boundaries.
The Neural Stack: A Biological Answer
If the Chomsky hierarchy maps onto cognitive evolution, the obvious question becomes: what neural machinery implements each level? For regular grammars, recurrent networks suffice — finite-state dynamics with memory implicit in the network state. But context-free grammars require a stack. Where is the biological stack?
The most striking proposal I’ve encountered comes from Rodriguez and Granger (2016). They argue that the hippocampus functions as a biological pushdown stack. The physiological basis is sharp-wave ripples — high-frequency oscillations (150-250 Hz) during which the hippocampus replays compressed sequences of neural activity. Forward replay pushes items to memory. Reverse replay pops them, accessing the most recently stored items first, exactly the behavior a pushdown automaton requires.
The prefrontal cortex, in this picture, provides the control logic: when to push and when to pop. Working memory capacity limits (Cowan’s realistic 4 chunks) can be reinterpreted as a limit on stack depth, constrained by the bandwidth of the cortico-hippocampal channel and the number of gamma cycles that fit within a theta cycle.
Here’s the clincher: Rodriguez and Granger argue that the computational power of a species is determined by the ratio of cortical size to hippocampal size. A larger cortex can buffer more “calls” to the hippocampal stack before saturation. The human cognitive leap isn’t a novel language module — it’s a phase transition in this ratio, crossing from simple regular grammars to mildly context-sensitive languages. The anatomy supports it: PFC projects to the hippocampus via the nucleus reuniens, and theta-gamma coherence between PFC and hippocampus correlates with working memory performance.
The biological implementation is necessarily noisy — attractor dynamics where “push” moves the system into a basin of attraction and “pop” is triggered by an end-of-sequence signal. This means the system doesn’t perfectly implement a stack; it implements something that behaves like a stack for shallow nesting depths and short delays, but degrades gracefully beyond capacity. The three-level center-embedding limit in human sentence processing is a feature of this noisy implementation, not a bug. Maybe it’s actually better than a perfect stack for real-world cognition — natural language rarely requires deep recursion, and a system tuned for the typical case is more efficient than one designed for a generic, rarely encountered case.
Where This Converges
Several threads are pulling together here, and our SymSeqBench sits at the center of the braid.
The Chomsky hierarchy as a cognitive metric: Not a linguistic curiosity but a formal framework for measuring cognitive complexity across species, architectures, and evolutionary transitions. SymSeqBench generates the task hierarchies and metrics needed to test this (some extensions required).
The Markov illusion as a diagnostic tool: When a system appears to handle complex sequences but fails at generalization, it’s operating below the complexity level of the task. SymSeqBench’s multi-scale analysis can distinguish surface statistics from genuine structural learning — the difference between riding correlations and enforcing rules.
Cross-species behavioral mapping: If grammar complexity places species on a cognitive map, we need standardized tools to measure it. SymSeqBench’s behavioral sequence analysis pipeline provides exactly this, and the early cross-species results suggest it works.
The neural architecture question: At what point in the Chomsky hierarchy do simple recurrent networks fail and biophysical features — dendritic nonlinearities, multi-timescale dynamics, cortical-hippocampal loops — become necessary? If the regular-to-context-free boundary requires something like a hippocampal stack, that’s direct evidence connecting neural architecture to formal computational power.
The questions I want to pursue from here:
Can spiking networks with biologically realistic features naturally implement pushdown-like behavior where point-neuron models fail? Can we build computational models with varying “cortex-to-hippocampus” ratios and measure the grammar complexity they handle? Is the allometric scaling prediction of Rodriguez and Granger computationally testable?
And perhaps the sharpest question: do the discontinuities in the Chomsky hierarchy — the boundaries between computational classes — correspond to discontinuities in the neural manifold geometry of systems processing these sequences? If the geometry changes qualitatively at the regular-to-context-free boundary, that would mean one of the most debated questions in cognitive science — whether recursion requires specialized neural machinery — has a precise, measurable answer.
The thread connecting formal language theory, hippocampal replay, cognitive evolution, and benchmark design feels like it’s converging toward something testable. The experimental framework exists. The questions are precise. Now we need the experiments (and the funding).
If you are working in or interested in collaborating along these lines of research, do reach out and let’s establish collaborations to tackle these questions and solve these critically important problems. Our tools provide the first steps, now we need to put them to good use.
References & Links
Rodriguez, A., & Granger, R. (2016). The grammar of mammalian brain capacity. Theoretical Computer Science, 633, 100-111. DOI: 10.1016/j.tcs.2016.03.021
Klein, C., & Barron, A. B. (2024). Comparing cognition across major transitions using the hierarchy of formal automata. WIREs Cognitive Science, 15(4), e1680. DOI: 10.1002/wcs.1680
Barron, A. B., Halina, M., & Klein, C. (2023). Transitions in cognitive evolution. Proceedings of the Royal Society B, 290(2002), 20230671. DOI: 10.1098/rspb.2023.0671
Voudouris, K., Barron, A. B., Halina, M., Klein, C., & Patel, M. (2025). Exploring major transitions in the evolution of biological cognition with artificial neural networks. arXiv preprint, arXiv:2509.13968. arXiv: 2509.13968
Sharma, A., Czegel, D., Lachmann, M., Kempes, C. P., Walker, S. I., & Cronin, L. (2023). Assembly theory explains and quantifies selection and evolution. Nature, 622(7982), 321-328. DOI: 10.1038/s41586-023-06600-9
Jager, G., & Rogers, J. (2012). Formal language theory: Refining the Chomsky hierarchy. Philosophical Transactions of the Royal Society B, 367(1598), 1956-1970. DOI: 10.1098/rstb.2012.0077
Fitch, W. T., & Friederici, A. D. (2012). Artificial grammar learning meets formal language theory: An overview. Philosophical Transactions of the Royal Society B, 367(1598), 1933-1955. DOI: 10.1098/rstb.2012.0103
Lashley, K. S. (1951). The problem of serial order in behavior. In L. A. Jeffress (Ed.), Cerebral mechanisms in behavior (pp. 112-136). Wiley.
Hickok, G. (2025). Wired for Words: The Neural Architecture of Language. MIT Press. MIT Press
Deletang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., & Ortega, P. A. (2023). Neural Networks and the Chomsky Hierarchy. ICLR 2023. arXiv: 2207.02098
Zajzon, B., Bouhadjar, Y., Fabre, M., Schmidt, F., Ostendorf, N., Neftci, E., Morrison, A., & Duarte, R. (2025). SymSeqBench: A unified framework for the generation and analysis of rule-based symbolic sequences and datasets. arXiv preprint, arXiv:2512.24977.
Levelt, W. J. M. (1974). Formal Grammars in Linguistics and Psycholinguistics (3 vols.). Mouton.




