It all comes down to objects. How does the visual system manage to establish and maintain representations of objects in the world, despite almost constant change in input with shifts of the head and eyes, change in the urgency of different goals, and change in the objects themselves? It’s crazy, really, that it can be done at all.
Anne Treisman’s early papers in the 1980s and 90s on Feature Integration Theory (FIT) were perhaps the first to explicitly articulate one of the fundamental challenges of object representation – the binding problem – and to suggest that its solution lay in spatial indexing via attention. Given that the front end of the visual system consists of a set of image filters by which feature content is represented across retinal space, how does the system solve the inverse problem of which features derived from which objects in the world? That is (one articulation of) the binding problem, and the solution offered by FIT was that given that feature extraction is spatial, spatial indices can be used to bind features by selecting (i.e., attending) a given set of them and establishing an object representation consisting of all of the features at the selected locations. Under this view feature integration is object representation.
The focus on visual search in the decades following the introduction of FIT led to a de-emphasis on object representation, and instead led to an explosion of questions about how attention is guided within scenes, how bottom-up attributes of stimuli versus top-down goals of the observer combine to determine what is attended, which visual attributes are represented “preattentively” and which are not, etc. – all fascinating questions that have led to important findings and new theoretical developments. But the focus on guidance and feature representation, left the assertion of FIT that what attention does is bind features and establish object representations implicit or in some cases, denied. Two papers in this special issue honoring Anne Treisman’s work have brought the question front and center, and have … well, captured my attention.
Evidence concerning the role of attention in binding and the representation of objects at the time that FIT was first introduced came from illusory conjunctions. Sometimes people see things that aren’t there, and what they see is a useful window onto what the visual system does (or doesn’t do) with the information that it extracts through those image-feature filters mentioned above. Illusory conjunctions occur when individual features from different objects are representationally mis-combined and reported as a single object with those features despite no such object being present. According to FIT, illusory conjunctions can occur among features that are derived from objects in locations that are outside of the current focus of attention and are therefore “unbound”. When I say illusory conjunctions are perceptions of things that aren’t there, that is only sort of true. First, the features that are perceived were there, though they are perceived in combinations that weren’t there. Second, illusory conjunctions are invariably reported after the fact, so one can quibble about whether they are mis-perceptions or mis-memories, a quibble that depending on what you are trying to explain is more or less important. I’m going to ignore that for this piece.
There are lots of open questions about illusory conjunctions, the answers to which will provide insight into the nature of object representations, how they are established, and how they are maintained. Confirming the functional success of the visual system, however, illusory conjunctions require some effort to induce and measure. They don’t tend to occur under normal circumstances, which is a good thing functionally, but a bad thing empirically. Early strategies for inducing illusory conjunctions used dual-task designs in which difficult visual discrimination was required for a subset of stimuli and then afterwards, other stimuli that were present but irrelevant to the primary task were reported. When misbindings were reported, it was consistent with FIT, but when they weren’t, it could be attributed to the challenges of using a dual-task approach to fully occupy attention on the primary stimuli. This one-way logic a limitation of the approach.
One of the papers reported in this special issue by Vul, Rieth, Lew, and Rich, took an alternative approach to inducing and measuring illusory conjunctions. They presented dense arrays of stimuli, that exceeded the limited spatial precision of selection mechanisms, thereby allowing an assessment of the perception of unselected stimuli within a single-task design. Taking this approach, they went beyond testing whether illusory conjunctions occur, and entertained a set of four specific hypotheses about different ways in which (mis)binding of features might occur by using multi-part objects. Each hypothesis predicted a specific pattern of joint reports of features from surrounding multi-part objects when the target object was too close to the others to reliably select. The upshot of the paper is that parts of objects tend to be reliably bound to specific features whether individually attended or not, but when selection fails to isolate an individual object, parts of objects (that include multiple, correctly-bound features) are sometimes misbound with correctly-bound parts of other objects to form representations of complex objects that were not presented, i.e., illusory conjunctions. This is a more complex, and no-doubt still incomplete, understanding of the role of attention in establishing and maintaining object representations than is implicitly assumed in map-based models of visual search and attention.
Missing in all of this discussion is the nature of the space that are we talking about, which brings me to the second paper in this special issue that captured my attention by Dowd and Golomb. According to FIT, binding occurs essentially through the spatial indexing of features via spatial selection. The characterization of space in FIT, and indeed all map-based models of attention, sidesteps the question of “What space?” There is a disconnect between the motivation for asserting a binding mechanism based on spatial indexing and the functional demands of object perception. Specifically, a starting point for FIT proposing a mechanism of binding through spatial attention was the observation that front-end image filters are spatial filters in that they are defined through neural receptive fields. Therefore, they can be understood as yielding spatially indexed representations of feature values that when selectively attended, could provide representations of integrated objects. But the space in which those front-end filters function is retinotopic, and the space in which we perceive and interact with objects is spatiotopic. As I move my eyes, head, and body, the representation of where objects are in the world is maintained despite dramatic changes of where (and how) they project to my retinae. Being functional models, feature maps in models like FIT and Guided Search are spatiotopic, they refer to locations within the display, not locations to which stimuli project on the retina. So…that’s a question. How is the retinotopic output of front-end image filters updated to accommodate eye, head, and object motion? One important observation concerning this question, which has been demonstrated in other areas as well now, is that the receptive fields of some neurons in parietal cortex actually shift in anticipation of an impending eye movement. Rather than reflecting an actual predictive remapping of the visual field, however, more recent evidence indicates that shifts in receptive fields are generally toward the saccade target, from all directions, which is more like a focused-selection mechanism than a remapping mechanism.
There is still much to understand about how spatiotopic attention is guided on the basis of retinotopic output from early-vision, and there is a large and active literature aimed at this question. But, a paper in the current special issue demonstrates some intriguing aspects of feature binding and misbinding as a function of retinotopic and spatiotopic locations of stimuli that identifies still greater (in my mind) complications to the story, while at the same time pointing to at least a finger hold on linking retinotopic and spatiotopic contributions to object representation. Dowd and Golomb (2020) had subjects maintain fixation at one location while cueing attention to another. Subjects then had to shift their eyes, after which multiple colored oriented bars were presented, with one in the original spatiotopic location of the cue (now a new retinotopic location), and another in the original retinotopic location (now a different spatiotopic location). The task was to report, on continuous scales, the color and orientation of the bar at the original spatiotopic location of the cue. The upshot of the paper is that misbindings (illusory conjunctions) occurred between objects at spatiotopic location of the cue (i.e., the target) and the object at the retinotopic location of the cue. Analyses were offered to argue that these misbindings reflected brief simultaneous attention to both the spatiotopic and the retinotopic location, rather than a sluggish remapping process. If this conclusion holds, it is an opening for linking map theories of attention and object representations, which are spatiotopic, with the retinotopic origin of those representations.
It all comes down to objects. I perceive Anne Treisman’s incredible body of work on FIT and the object-file framework as being concerned, ultimately, with understanding how the visual system manages to establish and maintain representations of objects in the world, despite change in input, goals, and the objects themselves. I’ll say it again, it’s crazy, really, that it can be done at all. While FIT and the object-file framework constitute amazing and inspirational beginnings, there remains significant disconnects between how we get from spatially indexed feature maps to object files. But studies like those of Vul and colleagues and Dowd and Golumb, as well as others in this special issue, provide glimpses into how we might build on Treisman’s foundations and eventually figure it out.
Psychonomic Society articles featured in this post:
Dowd, E. W., & Golomb, J. D. (2020). The Binding Problem after an eye movement. Attention, Perception, & Psychophysics, 82, 168–180. https://doi.org/10.3758/s13414-019-01739-y
Vul, E., Rieth, C. A., Lew, T. F., & Rich, A. N. (2020). The structure of illusory conjunctions reveals hierarchical binding of multipart objects. Attention, Perception, & Psychophysics, 82, 550–563. https://doi.org/10.3758/s13414-019-01867-5