Can you hear that face? Cross-modal perceptions

We learn at a young age that we see with our eyes, hear with our ears, smell with our noses. The clear anatomical differences between these sense organs appear to map intuitively on how we think about the different perceptual modalities. The scientific study of perception is likewise largely divided along these modal lines. Someone studying visual cognition might attend the vision sciences conference; someone studying audition, the meeting of the acoustical society; one interested in taste, “the international symposium on olfaction and taste”.

If sensory modalities were encapsulated modules—each working independently to provide us with information about sensory data in which they specialize—it would be fine that the people studying vision publish in different journals and go to different conferences from people studying audition.

But more and more, scientists are discovering that perception is inherently multimodal, and that focusing on studying sensory modalities independently may be hampering our progress in understanding perception itself—the process by which we derive meaning from sensory inputs.

In a new study published in the Psychonomic Bulletin & ReviewZweig, Suzuki, & Grabowecky from Northwestern University showed undergraduate students pictures of unfamiliar faces and trained them to associate each face with a voice saying things like “Hi I am Michael and I am 26.” The training continued until the face-voice pairings had been learned, and participants were then presented with a search task.

One of the learned faces was selected to serve as the target which was shown along with 15 other faces (distractors) from the same gender. The participants’ task was to find the target face as quickly as possible, reporting the quadrant in which it appeared. Simultaneously with the appearance of the display, participants heard a voice say a phrase like “Hi, I’m over here.” This auditory information was not informative of the location of the face (the way it would be in the real world), but on a subset of the trials, the voice matched the face that it was previously paired with. Performance on these congruent trials was contrasted with performance on trials when the voice was not previously learned or when the voice was previously learned, but reversed, which served as a control for some low-level acoustic information like pitch and duration. What happened? Participants were considerably faster (~13% speedup) in finding the face when hearing its matching voice than when hearing an unlearned voice.

Performance on the unlearned condition was similar to the voice-reversed condition. Subsequent follow-up experiments further ruled out effects of voice familiarity—showing that it was face-voice congruity rather than voice familiarity that was responsible for the search speedup, and that the search performance when hearing a matching voice was enhanced relative to not hearing any voice. Together, the results provide compelling evidence that hearing a newly learned congruent sound enables people to more effectively attend to the matching visual stimulus.

In a theoretically converging set of studies, also published in Psychonomic Bulletin & Review, Barenholtz, Lewkowicz, Davidson, and Lauren Mavica from Florida Atlantic University asked participants to learn which of four faces corresponded to a given voice (Experiment 1) or which of four dogs corresponded to a particular bark (Experiment 2).

In both experiments, learners were better at learning the pairings when the sound and visual stimulus matched on a categorical level. That is, people learned which of four faces matched a given voice when the faces and voice were both female or both male. People likewise learned much better to associate a bark with a particular dog than a particular bird. While the finding that some associations are learned more easily than others is not surprising, what is surprising is just how large a learning advantage is obtained when the visual and auditory stimuli mesh at a categorical level—a  finding expected on accounts positing that our representation of auditory and visual material is not altogether distinct.

Most work on cross-modal influences has focused on relationships between audition and vision. Many readers will be familiar with the finding that certain novel words (bouba/kiki) elicit particular visual forms (rounded and sharp shapes, respectively).[1]

A recent review paper, once again published in the Society’s Psychonomic Bulletin & Review, by Knöferle and Spence from Oxford University examined cross-modal associations of a less familiar sort; namely, between taste and sound. Multiple studies have now reported the existence of associations between, for example, sour tastes and high pitched sounds, and sweetness and an even rhythm.

The authors discuss several mechanisms by which such associations may arise including common intensity coding (high intensity stimuli in one modality become associated with high intensity stimuli in other modalities), and hedonic matching (positive acoustic stimuli tend to be associated with positive tastes), or the possibility that the associations are mediated by facial expressions made to certain tastes and also made when producing certain sounds. None of the proposed mechanisms fully account for the observed findings, but they do hint that metaphors such as “sweet melody” may be more than arbitrary linguistic conventions and reflect similarities in perceptual representations between the modalities.

These results join a growing literature of surprising and counter-intuitive empirical findings demonstrating rich relationships between what are still thought by many to be parallel, independent modalities. It is tempting to think that as more researchers delve into understanding these relationships, the more intuitive their existence will become, and the closer we will get to a fuller understanding of how we can hear that face.


[1] The classic result is Kohler’s baluma/takete effect: people overwhelmingly match baluma to rounded shapes and takete to sharp-cornered shapes; this effect was published contemporaneously by Sapir in 1929 using the stimuli mil and mal and popularized by Ramachandran and Hubbard in 2001 who replicated it using bouba/kiki as the stimuli). More complex and subtle auditory-visual associations between acoustic pitch, intensity, formant heights, and visual size, lightness, angularity, and spatial location, have since been demonstrated.

You may also like