Most objects that surround us seem familiar and are easily identifiable even when viewed from the corner of our eyes. We are so quick to identify objects that it almost seems trivial, but just like speech production, object recognition is quite complex. How is object recognition actually achieved?
Of course knowing what objects tend to be where in your environment will provide important first guesses. But when it comes to learning new object identities, general knowledge won’t be as beneficial. Instead, object-identity mappings need to be flexibly learned across viewpoints and locations; but how?
In a recent study published in Psychonomic Bulletin & Review, researchers Bowers, Vankow, and Ludwig tested people’s ability to recognize a novel object learned at one particular retinal location when this object was later projected to a new, untrained retinal location. Successful object recognition at the new location would require something called “translation tolerance.”
The authors discuss three hypotheses concerning translation tolerance: According to Hypothesis 1, tolerance is largely post-visual; that is, the result of learning many different high-level representations from different locations, and linking these representations to a common post-visual code. This slightly differs from Hypothesis 2, according to which a given object has exactly one common high-level visual representation, but in order to access it, different retinal locations need to be mapped onto an object’s higher-level visual representation. Both these hypotheses would predict encountering major difficulties when an object is projected to a novel retinal location that is distal from trained locations.
To anticipate the main findings: Bowers and colleagues provide evidence against these initial hypotheses and instead favor Hypothesis 3, according to which translation tolerance occurs within the visual system (by generating a single high-level visual representation from different locations), and is computed online regardless of an object’s retinal location. This allows for successful object recognition even when the object is presented at retinal locations that are quite distant from trained locations. The three hypotheses are summarized in the figure below.
To pit the hypotheses against each other, Bowers and colleagues made sure to create conditions in which (i) more extended sampling was possible than in previous studies; (ii) the items were more object-like; (iii) the objects differed from one another in more than some fine perceptual detail; and (iv) post-visual codes were not able to contribute to performance.
Participants were trained on a set of novel objects that differed in configural properties rather than in detailed features, as shown by the examples in the figure:
The figure clarifies that the objects could not be identified or distinguished from one another on the basis of their parts, but had to be identified as complete objects.
During the experiment, participants would first fixate on a central cross, and the objects would appear either left or right of the cross along with their auditorily presented letter names (“Q,” “V,” “C,” “S,” “D,” and “J”), which had to be learned.
Feedback was provided and the training phase lasted until the participant was able to correctly name 24 novel objects in a row. In the test phase, the position of the objects was modulated in that they either were presented at the same location, the opposite location, or in the center.
Results of Experiment 1 showed that participants were able to name the trained objects with more than 70% accuracy even when they were presented at novel positions, i.e. both at center and at the opposite side of the screen. These data clearly speak against hypotheses 1 and 2. In two subsequent experiments, the authors assessed to what extent the reduced performance was due to changes in the spatial location (Experiment 2) or retinal location (Experiment 3).
All three experiments essentially converged on the same finding: After learning to name novel objects at one retinal location, this ability can be transferred to other retinal locations with a high degree of accuracy.
The findings of this study lend support to “online” theories, which propose that high-level object codes are abstracted away of retinal locations, enabling widespread generalization in response to encountering an object at a single location. While some costs are unavoidable when moving away from a learned location, the robust translation tolerance demonstrated across the three experiments stresses the role of the underlying processes required for translation tolerance, particularly with regard to models of visual recognition. These tend to circumvent high-level abstraction and instead achieve tolerance by training the models with objects (or words) in every possible location.
So how robust are such high-level object codes? Could we test a Hypothesis 4 according to which visual object representations can be abstracted away to such a degree that it would allow translation to and identification in other, non-visual modalities like smell or sound?
This remains to be seen. For now we might simply want to look at a cone of ice cream and confirm that it tastes like one as well.