Many of us have the impression that gestures and exaggerated facial expressions help us understand others in loud environments such as bars or conference parties. Speech perception in many contexts is multimodal – we incorporate visual information when trying to make sense of what someone else is saying.
These cues are so consistently predictive of each other that they have spawned an entire Youtube cottage industry of the “McGurk effect.” The McGurk effect refers to the fact that we categorize speech sounds differently depending on the visual input we receive along with the sounds we hear. Check it out in this video:
The integration of multiple cues at once is critical for dealing with perceptual input. For example, when listening to speech, we incorporate tone (pitch), loudness (intensity), and how long a word unfolds over time (duration) when trying to understand what another person means, a topic we have discussed previously. Pitch and visual processing are especially intertwined, as seen in all the metaphors we have for intonation – someone speaks with a “rising” tone, we talk about “high” frequencies, someone can have a “high” voice. Many languages across the world appear to think of tone this way.
Even though pitch and visual processing are intertwined, there nonetheless appear to be large individual differences in how easily some people process pitch. Musicians, for example, may have what is often called “perfect pitch”, namely the ability to identify a musical note without the benefit of a reference tone. But even novices can identify tones that do not fit with the context, or that sound off-key. You can click here to test your pitch perception abilities.
Just like we might depend on different cues such as gestures in acoustically difficult situations such as conference parties, if a task is difficult for someone because they do not process pitch perfectly, they may benefit from additional cues from different modalities. The idea that performance can improve with multiple cues when some cues are not sufficient on their own (e.g., a gesture-sound combination) is linked to the Principle of Inverse Effectiveness (PoIE). So, for individuals who cannot rely on a single cue, they may come to rely on both cues.
Cecelie Møller and colleagues, in a study just out in the Psychonomic Society’s journal Attention, Perception, & Psychophysics, tested whether performance in discriminating pitch could be improved by adding a corresponding visual change that is predictive of a sound’s pitch. The authors were especially interested to see whether people with worse auditory-based pitch discrimination would experience a larger benefit than people with good pitch discrimination, in keeping with the PoIE.
The authors use a novel pitch discrimination task that combines visual cues with auditory ones. They then correlated performance across the conditions with participants’ baseline pitch discrimination aptitude. Their pitch discrimination task was a standard oddball detection task where participants heard tones in a sequence like A-A-A-A-B-A-A, and were instructed to press the space bar as quickly as possible when they detected a change from A to B.
In their experiment, changes in pitch were of one of two magnitudes‒smaller (20 cents) or larger (30 cents) and either higher or lower in pitch. While pitch is usually measured in hertz (Hz), cents, by contrast, are a measure of relative pitch. So, an A note and a B note have specific frequencies; the cents between A and B are a transformation on the ratio of those frequencies. People with poor discrimination tended to not be able to tell apart 20-cent differences, while people with excellent discrimination almost always identified 30-cent differences.
In addition to the auditory information, visual cues were added during the task in the form of a circle at the center of the screen. Keeping with previous sound symbolism work, sometimes the pitches of the notes corresponded to the location of the circle on the screen – the visual angle of the circle would be .5 degrees higher or lower relative to the center of the screen. The circle moving did not always correspond to a pitch shift, so participants could not use it as their only cue to whether the note had changed from the previous one. Furthermore, sometimes the direction the circle moved could be congruent or incongruent with pitch metaphors. In the congruent conditions, higher pitches occurred with higher circles, and lower pitches with lower circles. In the incongruent conditions, higher pitches occurred with lower circles and lower pitches with higher circles. An example of a sequence of trials is shown in the figure below. Each of the arrows corresponds to a sound being played. The MC and MmC conditions occurred when the circle also moved, either congruently or incongruently with the pitch (MC standing for matching cue and MmC standing for mismatching cue).
Two to four weeks later, participants took part in an assessment of their ability to distinguish between pitches, similar to the quiz linked to above. The difference between the pitches was determined based on aptitude, with larger pitch differences being easier to identify.
Using a staircase procedure, participants got increasingly difficult comparisons between tones until they reached a 70.7% performance level. Participants did the staircase threshold task 14 times and their threshold, or the level at which they maintained performance, was the average of the last 6 staircases. This provided a measure of individual differences for pitch processing which the authors correlated with the effectiveness of the visual cues on pitch discrimination.
To quantify pitch discrimination of participants in the experiment, the authors calculated d’ scores, a measure of how often participants identified tone changes when they were present (hits) compared to when there was no change (false alarms). The researchers counted any response that was made between 200 ms to 1000 ms after tone onset as a response. There were three questions of primary interest: whether the visual cues helped (bimodal compatibility gain), whether congruency (up meaning higher pitch and down meaning lower pitch) improved performance, and whether participants with weaker pitch processing benefitted more from the visual cues, in keeping with the Principle of Inverse Effectiveness.
Møller and colleagues found that the greater pitch differences a person needed in the pitch discrimination assessment task, the more the visual cues helped in identifying the pitch oddballs (reproduced below). Visual cues appeared to help everyone, though, with congruent trials helping more than incongruent trials.
Altogether, this study provides additional evidence that multimodal cues are useful in a novel context – the perception of pitch. Just like many other aspects of day-to-day perception, such as using how your keyboard feels when typing, or listening to and feeling yourself while speaking, the addition of visual cues appears to help people process auditory information, and this is especially true for the people who need the most help.
Psychonomic Society article featured in this post:
Møller, C., Højlund, A., Bærentsen, K.B., Hansen, N. C., Skewes, J. C., & Vuust, P. (2018). Visually induced gains in pitch discrimination: Linking audio-visual processing with auditory abilities. Attention, Perception, and Psychophysics. doi: 10.3758/s13414-017-1481-8