News from the Cocktail Circuit: Extracting useful information from the din

You are deeply involved in a conversation with someone at a party when suddenly you hear someone say your name, and, before you even know what happened, your attention is transported toward the voice that uttered it. This is an example of the cocktail party effect, first described in 1959 and recently extended even to visual stimuli.

This effect was discovered as part of a series of studies, started in the early 1950s and continuing to the present, aimed at understanding how people attend to specific information in noisy environments. For example, how do people selectively attend to their conversational partners while filtering out all the other conversations and ambient noise? A specific problem these early researchers were trying to solve was how to help air traffic controllers attend to transmissions from a particular pilot without mixing them up with what other pilots were saying at the same time. This general line of research came to be known as the “cocktail party problem” (a cocktail party would have been a more familiar environment than an airport control tower)—and is reviewed at length in a newly published article in Attention, Perception, & Psychophysics by Adelbert Bronkhorst.

Bronkhorst notes that “if engineers would design an acoustic signal for communication among humans, resistant to all kinds of acoustic interferences, they would probably not come up with something resembling natural speech.” And yet, speech turns out to have some rather remarkable features. Its acoustic properties make it quite resistant to common sources of environmental noise. This is one reason why it is relatively easy to have a conversation on a busy city street—a rather chaotic acoustic environment.

Speech is also highly redundant. The redundancy of speech means that it is possible to attend to a given speaker in a variety of ways, for example, by attending to a location (people, like many other animals are very adept at localizing where sounds are coming from, and accurately attending to those locations), to a specific voice, and even to specific content. Using this information can help to group together utterances by the speaker or speakers to whom one is trying to attend, while filtering out others. One general finding from this literature is that location beats voice for grouping: it is easier to selectively attend to speech coming from a particular location, even if the speaker occasionally switches (e.g., as when eavesdropping on a conversation), than to selectively attend to a particular voice as it switches from one location to another.

The redundancy of speech is also apparent from its resistance against various kinds of distortions. In fact, one can remove just about every signal that normally distinguishes speech sounds from one another and still make out what someone is saying. This is especially true if you have some expectations of what you are about to hear. A fabulous demonstration is “sine wave speech” – speech filtered to have normal phonetic cues removed. Some demonstrations can be found here. First, listen to the sine-wave sound clip (labeled “SWS”), then listen to the original sentence. Now listen to the SWS clip again. Most people experience a radical transformation such that originally nonsensical series of sounds “snaps into” easily understandable speech—a powerful demonstration of how what one hears is affected by one’s knowledge and expectations.

Further redundancy in speech comes from our knowledge of the language (as well as general world-knowledge). In one study cited by Bronkhorst, it is estimated that words missing from short everyday sentences can be filled in with 70–90 % accuracy. Of course sometimes this process of filling in goes awry. For years I thought the Beatles were singing about Amanda, a pretty nice girl who doesn’t have a lot to say. I wasn’t the only one.

One often-used method for studying which cues people are using when they attend to a given speaker is called the Coordinate Response Measure task. In this task people hear commands such as “Ready Baron, go to green four now.” Listeners are instructed to only attend to a particular call sign (e.g., “Baron”), which signals the target speaker, while ignoring other non-target speakers, and then to execute the command requested by the target speaker. By presenting listeners with multiple speakers at the same time, it is possible to explore what factors lead to errors. Typically, people find it easy to hear their call sign, but then often proceed to execute the instruction said by a non-target speaker. One factor that drastically improves performance is if the target call sign is heard from a particular location. By contrast, a factor that leads to increased errors is if the target and non-target speakers are the same gender. Interestingly, this problem can be substantially reduced if the target speaker is familiar, for example the listener’s spouse.

Returning to the cocktail party effect, one of its most fascinating aspects is that in order to recognize one’s name in an unattended conversation, a listener must be processing all the speech around them at least to a degree that allows distinguishing one’s name from all the other noises and words in the background. This is similar in some respects to ignoring all kinds of noises when asleep, but waking up immediately on hearing an alarm one is used to.

Interestingly, in a typical experimental context used to test for the cocktail-party effect, only about a third of participants detect their name in an unattended conversation. Some work has suggested that the likelihood of noticing one’s name is predicted by their working memory capacity, as reported in a 2001 study published in Psychonomic Bulletin & Review. People with lower working-memory capacities may have difficulties inhibit distracting information. It is interesting to speculate about other sources of individual differences: One can imagine that a person with an unusual name will be more likely to show the effect than someone with a common name.

Certain kinds of musical training may also prove relevant. A 2011 Radiolab Episode—“A 4-Track Mind”—profiled a ragtime pianist, Bob Milne who can individually attend to four symphonies simultaneously, and can imagine multiple musical compositions simultaneously, starting and stopping each one on command. Collecting data from much larger and more varied populations than what is typical in psychology experiments, as made easier by crowdsourcing, may reveal further surprises into what is typical, and what is possible.

The use of crowdsourcing to discover individual differences easily missed in the lab will be the subject of my next post.

You may also like