Remember answering machines? Me neither, but this is why Seinfeld exists. When Jerry’s girlfriend, Sophie, leaves a message without stating her name (just, “it’s me!”), Jerry decides to call her back without stating his own name.
Stripped of visual and contextual cues—Jerry’s face, what he’s wearing, his name—Sophie fails to identify him by just his voice, confusing him with Rafe (who already knows the tractor story)!
Identifying individuals by their voice is a difficult problem, and the Seinfeld clip illustrates two different aspects of it. Sophie has trouble telling Jerry and Rafe apart, mistaking one for the other. Sophie also illustrates the problem of telling voices together. When Jerry disguises his voice, to continue the ruse of being Rafe, Sophie presumes that she is still talking to Rafe; just that Rafe has a cold. Sophie thinks that Rafe and Rafe-with-a-cold are the same person, mistakenly identifying them as the same.
In a recent article published in Psychonomic Bulletin & Review, Nadine Lavan, Mike Burton, Sophie Scott, and Carolyn McGettigan discuss the understudied problem of telling voices together. Lavan and colleagues point out that the immense variability within individual human voices indicates that it is not just a problem in telling apart Jerry and Rafe, but in realizing that Jerry’s voice is Jerry’s voice whether he is laughing, whispering, or even imitating Rafe.
Within-voice variability can indeed be striking. Hank Azaria is credited with voicing over 30 characters on The Simpsons. These characters do not sound alike. In fact, listen (rather than watch), this clip of Azaria running through as many famous Simpsons characters as he can in 30 seconds:
Now try to imagine whether you would have guessed that all these voices came from the same person.
Within-voice variability like voice acting (or voice disguising) falls under what Lazan and colleagues call volitional variability. Most people speak differently when talking to a baby or a colleague; or in a loud bar versus a quiet library. Within-voice variability can also be spontaneous—a result of being sick, going through puberty, or being in an emotionally-heightened state.
Even outside voice acting, in which voice characteristics are changed dramatically and intentionally, normal within-voice variance is high. Lazan and colleagues illustrate this using the figure below, which shows the waveforms and spectrograms from screaming, speech, and laughter, revealing the readily apparent physical dissimilarity that characterizes these sounds.
Since there is such high variance across sounds, which can all be made by the same individual, how can vocal recognition solve this problem?
One property Lazan and colleagues identify as going a long way toward ameliorating the trouble with within-voice variability is familiarity. For example, when students in one of the studies reviewed by Lazan and colleagues were asked to categorize voices into their owners, students who had never heard the voices before categorized them into 4-9 identities, whereas students who had heard the voices before created only 3-4 categories. Of course, even the familiarity effect was not perfect—in reality there were only two speakers total.
Familiarity effects have also been found for language, which Lazan and colleagues argue is likely to be due to native speakers of a language picking up on key pronunciations that non-native speakers cannot. For example, the idiosyncratic way I say “poster,” may identify me as a Philadelphian. A native speaker may pick up on the subtle difference between “pow-ster” and “poster,” but someone who doesn’t speak English well might miss out on the difference, and thus the cue to identity.
And just in case you are wondering whether you could pick a Philadelphian:
The general form of the recognition problem itself—how can we identify something that arises from vastly different sensory signals, across different contexts, times, and places—is not new. The visual system must solve this problem in identifying faces (and of course other objects, but faces have a similar property to voices in that they typically belong to one person).
Intriguingly, the literature on face identification shows that within-face variability can actually help a viewer recognize a face. That is, what makes your face unique is not just the features that stay the same (like the shape of your eyes or the color of your lips), but the way your features change across time and context. Similar variability has not yet been identified for vocal recognition, but Lazan and colleagues highlight this as a potentially useful avenue for future research.
The way forward may come from machine learning. As deep convolutional neural networks are having remarkable success at performing some aspects of object and scene recognition, they are also beginning to offer hypotheses about the common information contained in very different examples of the same category. Although, as Lavan and colleagues point out, we may not identify a singular “voice-print” in the form of a common signal across all vocal modulations, we may be able to start identifying potential sources of discriminability and testing whether those are what people actually use to identify voices.
We are a long way from “voice-prints,” but unlike object recognition, humans are surprisingly bad at voice recognition. Perhaps this is because the auditory cues coming from voices were (in an evolutionary sense) always accompanied by the visual cues of the person speaking. Even in the stone age, it was rare for fog to be thick enough to obscure the identity of a person talking to you. Of course, in the modern age of phones (and in the slightly less contemporary epoch of answering machines), these sources of information are dissociated, making vocal identification a much more common problem. At least, for Seinfeld, mistaken cases of “telling people together” provided plenty of good joke fodder (slightly NSFW).
Reference for the Psychonomics article discussed in this post:
Lavan, N., Burton, A.M., Scott, S.K., & McGettigan, C. (2018). Flexible Voices: Identity perception from variable vocal signals. Psychonomic Bulletin & Review. DOI:10.3758/s13423-018-1497-7.