Positioning Is All You Need
Positioning Is All You Need
Xin Li
arXiv:2404.01183v1 [q-bio.NC] 1 Apr 2024
Abstract
One can drive around safely using a GPS without memorizing a world map
(not to mention the dark regions that humans have never explored) because
we only pay attention to the next instruction of turning. Such simple observa-
for localized attention. This attention mechanism, at the system level, imple-
ments the selection/positioning operation needed by the hippocampus to pro-
vide spatio-temporal context to the neocortex. The localized receptive field of
plies that the attention mechanism is all you need to understand the flexible
behavior generated by hippocampal-neocortical systems. Geometrically, we
present a novel manifold positioning framework to explain the principle of lo-
1
calized embodied cognition. Through the co-evolution of the hippocampus and
the neocortex, the positioning operation implemented by the attention mecha-
Introduction
The sensory organs attached to mammalian brains can access only a localized region in space
at any specific time. Such spatio-temporal locality of sensory observation has a profound im-
plication for the mental construction of the internal world (a.k.a., internal representation (1))
including consciousness by the mammalian brain. Nature has discovered a clever solution to
learn a global representation of the physical world with a local receptive field of sensory cortex
by flexible behavior including spatial navigation. How do mammalian brains encode the spa-
tial information about the physical world into the population of spike sequence in the temporal
domain? Such a question is at the holy grail of neuroscience whose importance to cognitive
science is equivalent to the relativity theory in modern physics.
To formulate the above problem geometrically, we note that the observation data acquired
by localized sensory organs can be abstracted by the projection of a manifold embedded in a
tion techniques (e.g., IsoMAP (2) and locally linear embedding (3)). An important new insight
brought by this work is that we do not need to reconstruct the global manifold but only need
to know the position within the global coordinates for goal-oriented behavior. This insight is
2
in alignment with intelligence without representation (1) and animate vision (4) which attempt
to put perception and action on an equal footing. Using GPS as the metaphor, we only need
access to a local map without the need to memorize the global map anywhere and anytime.
Such locality principle, when combined with embodied cognition, is a fundamental advantage
that has been exploited by nature during the evolution of mammalian brains.
Through co-evolution of the hippocampus and the neocortex, nature solves the manifold
positioning problem by the simple attention mechanism (5), which can be mathematically mod-
eled by context (6) and computationally implemented by the transformer architecture (7). At
the level of individual neurons, attention or selection of firing patterns is implemented by the
nonlinear effects of inhibitory networks (8). At the level of cortical columns, canonical circuits
implementing predictive processing (9) are responsible for context-dependent modulation of
prediction errors. At the level of cortical areas, hippocampal indexing theory (10,11) appears to
be the most comprehensive treatment of modeling hippocampal-neocortical interactions (12).
By serving as the neocortex’s librarian (pp. 285, (8), the hippocampus uses an indexing-based
positioning strategy to dynamically stabilize the sensorimotor activities generated by the neo-
cortex.
In this paper, we provide a geometric interpretation of the positioning/indexing strategy
universal and generalizable is to explicitly learn a nonlinear mapping that connects the local
geometry with the global topology. Since we are not interested in manifold reconstruction
(e.g., the dark areas of the world map), an indexing or positioning strategy is sufficient for
“navigating” from one place to another. This intuition is consistent with “using the world as
its own model” (1) but we are taking one step forward by noting that we only need a localized
3
version of the world model due to the locality constraint of our sensory organs. Using the
map analogy from a thousand brain theory (14), we need a librarian (i.e., the hippocampus)
to maintain efficient indexing and retrieval of 150,000 maps generated by the neocortex. This
librarian can easily manage more maps as the intelligence evolves - note that the neocortex has a
virtually infinite capacity thanks to its small-world network organization (8); while the network
organization of the hippocampus does not change much from mammals to humans.
Toy Example
To facilitate the illustration of our core idea, we introduce a toy example based on marrying
place cells as discovered in the hippocampus (15) with the two-spiral dataset as widely used
in manifold learning literature (13), as shown in Fig. 1. Even though the maze experiments
used in studying rodent models have not used such a spiral setting (16), we think it is plausi-
ble to assume that mammalian brains such as rodents can easily manage a maze with a spiral
shape. The deeper implication of this toy example is that if place cells are interpreted as spatial
context, we can generalize this hypothesized experiment to higher dimensions. For example, if
object recognition is formulated as a maze problem in a higher-dimensional space, how can we
despositionign an intelligence machine for navigation?
In conventional wisdom such as IsoMAP (2) and locally linear embedding (LLE) (3), the
objective is to exploit the local subspace constraint of the manifold by nonlinear dimensionality
reduction. The drawback of this class of solutions is that they do not generalize well to arbi-
trary manifolds, especially in higher dimensions. The fundamental reason behind this failure
can be explained by intelligence without representation (1) - under the framework of embodied
cognition (17), perception and action have an equal footing. From the evolutionary perspec-
tive, goal-oriented behavior is the foundation of natural intelligence. Therefore, navigation on
the manifold is easier than reconstruction of the manifold; and nature prefers simple but not
4
a) b)
Figure 1: The place cells discovered in the hippocampus provide spatial context for the task of
navigation. Such contextual modeling can be exploited to navigate the local Euclidean space of
an arbitrary manifold by “memorizing” its position in the global coordinates.
simpler solutions. A moment of thought can tell that a place cell-based navigation strategy can
We advocate a new conceptual framework called “localized embodied cognition” with the lo-
cality principle in mind from problem formulation to solution algorithms. We still use the world
“as its own model”, but the way local embodied cognition never attempts to solve a global prob-
lem locally. An organism needs to couple sensory with motor systems to achieve various goals
but never stores knowledge about the global environment. Instead, the bodily interactions with
the environment (situatedness) are persistently maintained by the neocortex (a single book)’s
interaction with the hippocampus (a large library). The cognitive map in the hippocampus of-
fers a situational context to interpret the sensorimotor interaction occurring in the neocortex.
In this section, we first present computational models for characterizing indexing operation in
5
the hippocampus and sensorimotor interaction in the neocortex and then connect them with a
universal attention mechanism.
a) b)
Hippocampal Indexing Theory. The key idea underlying hippocampal indexing theory (10,
11) is that the hippocampus is “functional designed and anatomically situated” to handle the
diverse neocortical activities generated by sensorimotor interactions. By projecting back to the
neocortex, the hippocampus stores a collection of indexes that can serve as the context for the
storage and retrieval of episodic memory (18). Recent findings indicate a more complex inter-
action between hippocampal functioning and episodic memory, with an emphasis on the cog-
nitive map interpretation (19). This newer perspective considers the hippocampus as creating
spatial maps (cognitive maps (20)) for memory formation. The theory also recognizes the hip-
pocampus’s role in binding different elements of an experience, including time and space, into
a cohesive memory (21). Furthermore, the relationship between place cells in the hippocampus
and memory formation suggests that these cells may function both as part of a cognitive map
and as index cells for memory (22).
6
Neocortex as a Sensorimotor Machine. One of the biggest mysteries in studying cortical
columns is that they are structures without a function (23). Instead, several canonical microcir-
cuits such as Douglas-Martin model (24) and Haeusler-Maass model (25) have been studied for
predictive coding in the literature. These models have revealed a remarkable correspondence
between the microcircuitry of the cortical column and the connectivity required by predictive
coding. In Douglas-Martin model (24), information flow through the cortical column is charac-
terized by a stereotypical pattern of fast excitation followed by slower and longer-lasting inhibi-
tion in the cat visual system. The three neuronal populations receive thalamic drive and amplify
transit thalamic inputs to generate sustained activities while maintaining a balance between ex-
citation and inhibition. This model was further extended by Haeusler and Maass in (25), which
closely resembles the canonical circuits required by predictive coding. The feedforward predic-
tion errors from a lower cortical level arrive at granular layers and are passed to excitatory and
inhibitory interneurons in supragranular layers, encoding expectations. Meanwhile, the connec-
tions between excitatory and inhibitory neurons in supragranular layers enable deep pyramidal
which nicely fits the functional design of the neocortex (i.e., detecting regularity in stimuli by
sensorimotor interaction). To accommodate the constantly changing properties of the global
world model, an organism needs a cognitive map for navigation. This map can be mathemati-
cally abstracted by a nonlinear mapping between local subspace geometry as discovered by the
neocortex and global topology as maintained by the hippocampus. Instead of reconstructing
7
the entire manifold, we argue it is sufficient to retrieve its local projection based on the context
information. This nonlinear mapping is essentially an attention mechanism that can be imple-
mented by a sparse and distributed memory (SDM) (26) (attention approximates SDM (27))
or an associative memory (e.g., modern Hopfield network (28) is intrinsically connected with
the transformer architecture (7)). SDM-based indexing/positioning supports continual learn-
ing (29). Real-world map-based navigation serves as a perfect example of the plausibility of
this manifold positioning strategy. Why can we safely drive around using GPS without memo-
rizing the entire world? Because at any time and location, we only need access to a localized
version of the world map for navigation purposes. Note that there is no need to reconstruct a
map for an area if no one ever goes there, which is consistent with goal-oriented behavior in
natural intelligence.
In this section, we revisit four widely studied cognitive tasks from the new perspective of lo-
cally navigating on a globally unknown manifold. The unifying theme is that positioning is
all you need to understand how the human brain solves these problems by a universal cortical
processing algorithm, as envisaged by Vernon Mountcastle in 1957 (30).
Spatial Navigation. Cognitive maps organize knowledge to support flexible and adaptive be-
havior. Among various behavioral tasks, spatial navigation is among the most well-studied in
the literature (19, 20). How do mammalian brains encode 2D physical space information into
1D spike sequence? This question is at the core of understanding the relationship between
space and time in its simplest form. The representation of spatial information (Euclidean or
non-Euclidean) plays an essential role in the building of cognitive maps. A graph-based rep-
resentation has been widely adopted for encoding the physical space into action-related state
8
space. A fundamental problem is the aliasing problem because identical sensory observations
can occur in different spatial locations (20). The aliasing problem is solved by attention in the
hippocampus where the encoding of spatio-temporal context (index to the local map or position
on the global map) is handled by CA3-CA1 collateral system (pp. 326 (8)). The consolidation
of semantic memory (analogy to global manifold topology) from episodic memory (analogy to
Motor Control. Motor control refers to the process of physically interacting with objects
in the world and manipulating them toward completing a specific task/goal (e.g., reach and
grasp). In motor control and trajectory planning (31), internal models consist of: 1) a forward
component responsible for predicting sensory consequences of the executed motor commands;
and 2) an inverse dynamics model that calculates forward motor commands from the desired
a water bottle differs from that of grasping a coffee cup at the surface; but the principle of
motor commands is the same except for the change of coordinates (from egocentric for spatial
navigation to allocentric for motor control). Using the library as the metaphor again, what the
hippocampus needs to adapt is to re − position the neocortex - i.e., from one book of grasping
without a handle to another with a handle. Unlike navigation, object manipulation does not
require path integration for trajectory planning because the scale of motor control is the same
9
Object Detection. In the spatial navigation of rodents, the environment or place learning
is achieved by ego-motion of animals, which dynamically changes the bearing and sketch
maps (33). Such behavior becomes more sophisticated as the evolution of the visual cortex
helps locate salient objects (a.k.a. landmarks) in the environment (34). From a manifold po-
sitioning perspective, we argue that there is an appealing analogy between the integrated map
for spatial navigation and the shape description for animate vision (4). That is, the problem of
object/landmark detection can also be solved by the inferior temporal (IT) cortex (35) assuming
that the spatial-temporal context is provided by the hippocampus. Mathematically, a place cell
can be abstracted by a Dirac function δ(x, y) that only fires when the organism is in the neigh-
borhood of (x, y). By analogy, a gnostic cell (a.k.a. concept cell (36,37) can fire if and only if an
object is detected by the neocortex in a latent space (e.g., generalized Hough transform (38)).
The spatial attention mechanism implemented by the gaze becomes critically important here
because it reduces the computational burden by indexical reference (4). More importantly, po-
sitioning of the gaze direction is naturally required by the visual cortex to recognize an object
stitution between language and action in the theory of embodied language (40). By restoring
the central role played by the motor system in language processing, we can gain deeper insight
into the essence of language from a motor resonance or action simulation perspective (41). The
class of referential motor resonance is the source of abstract thoughts or covert movements in
our mental world, which are hierarchically organized. The relationship between hierarchical
motor control (42) and the nested structure of natural languages (43) can be better understood
from a hierarchical extension of the attention perspective. Both language and motor control
demonstrate compositionality, whereby complex structures are built from simpler elements ac-
10
cording to combinatorial rules. In natural language, sentences are composed of phrases com-
posed of smaller units such as words and morphemes. This syntactic structure manifests as the
ability to generate an infinite variety of sentences from a finite set of words and grammatical
rules. Similarly, in motor control, movements can be decomposed into smaller units, such as
motor primitives, which are organized hierarchically to generate complex behaviors. More-
over, movements can comprise sequences of simpler actions or motor primitives, combined
to produce more complex behaviors. The latest study on relating neural representation of the
hippocampal formation (e.g., Tolman-Eichenbaum Machine (19)) to transformer (44) seems to
support that attention is needed for both spatial navigation and language comprehension. The
curios position formula of tokens in the transformer architecture using sine and cosine func-
tions seems the de-aliasing result that is conceptually similar to the omnidirectional place cells
11
8. G. Buzsáki, Rhythms of the Brain (Oxford university press, 2006).
14. J. Hawkins, A thousand brains: A new theory of intelligence (Hachette UK, 2021).
15. J. O’keefe, L. Nadel, The hippocampus as a cognitive map (Oxford university press, 1978).
22. M.-B. Moser, D. C. Rowland, E. I. Moser, Cold Spring Harbor perspectives in biology 7,
a021808 (2015).
12
23. J. C. Horton, D. L. Adams, Philosophical Transactions of the Royal Society B: Biological
Sciences 360, 837 (2005).
27. T. Bricken, C. Pehlevan, Advances in Neural Information Processing Systems 34, 15301
(2021).
(2023).
36. R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, I. Fried, Nature 435, 1102 (2005).
37. R. Q. Quiroga, G. Kreiman, C. Koch, I. Fried, Trends in cognitive sciences 12, 87 (2008).
13
39. D. H. Ballard, IJCAI (Citeseer, 1989), vol. 89, pp. 1635–1641.
40. M. H. Fischer, R. A. Zwaan, Quarterly journal of experimental psychology 61, 825 (2008).
14