Who’s having a party next door? Hearing but not counting voices

There is a party in the hotel room next to you. You are beginning to worry about your 8 am keynote address the next morning, and you’d love to tell those folks to quieten down. But how many of them are there? Just 2 or 3? Or more? What if it’s a whole troupe of Olympic weight lifters from Bulgaria? Should you call hotel security or politely ask for quiet yourself?

While you ponder this question you begin to think of the cognitive evidence relevant to your situation.

A recent article in the Psychonomic Society’s journal Attention, Perception, & Psychophysics addressed your party dilemma in the laboratory. Researchers Kawashima and Sato presented participants with simultaneous speech from a number of talkers and asked for a numerosity judgment—how many talkers were there?

Judgments of numerosity—that is, judgments of how many of a kind there are—have been of long-standing interest to cognitive psychologists and other scientists. Intriguingly, many animals seem to have a sense of number: Rats can be trained to press bars either 8 or 16 times for reward, and how hungry they are determines how fast they press the bars, but not how often, therefore indicating that they have a sense of the required number of bar presses to obtain a reward. And outside the lab, lions in the wild have a sense of numerosity, as revealed by the fact that adult females are more reluctant to approach groups of 3 female intruders, rather than just a single intruder.

In humans, numerosity judgments are typically studied in the visual domain: People are presented with a varying number of objects and have to indicate their number as quickly and as accurately as possible. Those studies typically find that people can enumerate between 1 and 4 items at great speed and with great accuracy and confidence—and ability known as subitizing. Any number greater that that requires considerable additional time, suggesting that people may switch from immediate perception of numerosity for a small set of objects, to counting for a larger set.

By contrast, very little is known about the perception of numerosity of auditory stimuli, which renders the study by Kawashima and Sato particularly intriguing.

In their first experiment, Kawashima and Sato presented anywhere between 1 and 10 talkers simultaneously who spoke for just under a second or 5 seconds, depending on condition. The participants’ task was to determine the number of talkers and report their number aloud. In this study, all talkers (i.e., humans who were speaking) were presented using a single speaker (i.e., the device that emitted the sound).

The data are shown in the figure below, which plots participants’ judgment as a function of the actual number of talkers present. The blue line shows the data for speech of 5 seconds and the black line is for the brief presentation duration. The dotted diagonal line indicates what perfect performance would look like—it connects any number N of actual talkers to a judgment of exactly N talkers being perceived.

It is immediately obvious that people’s judgments mirror the correct number of talkers only at the lower end of the scale: When 1, 2, or 3 talkers were presented simultaneously, participants’ judgments followed suit. However, from 3 onward, judged numerosity fell dramatically behind the true number of talkers.

Participants performed slightly better when the talkers were of mixed gender rather than all male or all female, but those differences were not particularly large.

In a second experiment, Kawashima and Sato extended the duration of speech and additionally introduced spatial separation between talkers. That is, whereas in the first experiment there was a single physical speaker, in the second study between 1 and 6 people spoke simultaneously, but each talker was delivered by its own physical speaker. Speakers were spatially separated from each other.

The main results were the same as before: Participants had difficulty judging the number of more than about 3 talkers, and their accuracy improved the more time they had overall to listen to the speech. In addition, people were found to be more accurate when the talkers were more widely separated in space: The different talkers were either presented through speakers that were close together or further apart. The latter was found to support better performance, a finding that was confirmed in a third experiment.

Why was this effect observed? What is it that makes it difficult for us to judge the number of voices when there are more than just 2-3 speakers? One possibility is that the simultaneous presence of multiple voices creates a cacophony that is difficult to analyze. That is, it may just be that the signal-to-noise ratio is declining rapidly as more talkers are chatting away. An alternative possibility is that people have difficulty enumerating the number of voices, but that they can discriminate them well—in other words, the problem may be in the counting, not the perception, of different voices.

To disentangle those two possibilities, Kawashima and Sato conducted a fourth experiment in which they not only manipulated the number of speakers, but also the number of additional “speech-like noises.” A speech-like noise is a form of random noise whose power spectrum is equal to the long-term average spectrum of each talker and whose amplitude follows the contour of a stimulus voice. In other words, speech-like noises are not voices but on average their acoustic energy is equal to that of a human voice.

The results were clear: What mattered was the number of talkers, not the presence or number of speech-like noises. In other words, people’s poor performance with an increasing number of talkers did not reflect a decreasing signal-to-noise ratio but an inability to differentiate and count talkers. Kawashima and Sato offer a number of possible explanations for their observed effects. One promising explanation involves the notion of “perceptual indexing”, which refers to a process that tags certain salient features of a “scene” for later processing. Each feature thus becomes individuated and can be tracked individually. In vision, it is known that perceptual indexing becomes difficult with more than 3-4 objects. Kawashima and Sato may have uncovered a parallel limit for auditory scene analysis.

So what does this research tell us about your dilemma in the hotel? It suggests you should call security rather than trying to pacify the party by yourself—you are likely to think that there are only 4 people partying when in fact it’s the entire All Blacks doing the Haka.

Who’s having a party next door? Hearing but not counting voices

You may also like

From Bach to Bayes and Wales: the Richard Morey challenge

Surviving the Hajj and Escaping from an Empty Palace: Agents and Games to the Rescue

Face or mask? A Turing test for hyperrealistic masks

1 Comment