Overcoming babble with a bubble: “Seeing” speech can make language faster to process

Recently, an error was found in this paper. The updated paper is here.

—–

Do you like talking on the phone to strangers? No? Well, neither do I. And for good reason – talking to someone you do not know over noisy speakers that lose part of the sound spectrum can be challenging, especially if you struggle with hearing loss or are a non-native speaker of the language, who might also have to contend with being misunderstood.

The former issue is one I have a lot of familiarity with. Anyone who has spent significant time with me knows that I am mostly deaf in my right ear. Relative to people with normal hearing, I rely more on external cues, such as lip movements, body language, and knowledge about the current topic being discussed to understand speech. But even for people who are not hard of hearing, this type of multisensory integration is key to speech comprehension and may make comprehension easier.

When context cues and visual information are taken away, such as when a friend is quoting from Purple Haze over the phone, we’re more likely to confuse “the sky” for “this guy” as a completion to “Please excuse me while I kiss…” These so-called “slips of the ear” in song lyrics highlight how much listeners have to work to understand speech. Listeners have to be able to know who is talking, no matter whether they are laughing or telling a story; be able to pay attention to a single person even if a hundred other people are talking; and integrate multiple sources of information, like audio and video, at the same time, all of which can be taxing to working memory and attention. Even excluding confusing situations like peculiar song lyrics, we often ask speakers to repeat themselves. So, making speech easier to understand would benefit all of us.

Typically, the more information people have access to, the better they can understand language. Even people with normal hearing benefit from visual cues and knowledge about what a speaker might say. This is at least in part because real-life and digital face-to-face conversation is the norm. Some websites can help visualize your speech in real time, as you say it. You can check this out by with Chrome Music Lab’s spectrogram visualizer. The output will look something like this:

While this is a nifty user-interface trick, and visual cues are helpful for speech recognition, little research has probed into whether these types of features or signals might help listeners understand speech better.

Thankfully, researchers Julia Strand, Violet Brown, and Dennis Barbour recently published an article in Psychonomic Bulletin & Review on just that: how a circle that changes in size might help speech recognition and/or decrease listener effort.

Improving speech recognition could be as simple as capturing the listener’s attention, so changing the visual input might just generally grab the listener’s attention, thereby improving comprehension indirectly. Experiment 1 created four conditions that tested whether and how much visuals should line up with what we hear. In the “audio-only” (control) condition, a circle was present on the screen the whole time and did not change shape. In the “static” condition, the circle appeared when speech started and disappeared when speech stopped, but it also did not change in appearance. In the “signal” condition, the size of the circle was directly related to how loud the speech was. Finally, the “yoked” condition created a circle that was actually associated with a different sentence, to see whether a moving figure—even if asynchronous to the sound—just made the speech more engaging. This video presents an example of each of the four conditions:

During the experiment, participants were asked to type the sentences that they listened to into a response box. The sentences were embedded in a particular kind of noise known as two-talker babble, which is sort of similar to what happens at cocktail parties – babble presents two people’s voices at the same time, which makes speech recognition more difficult.

The authors found a surprising result: visual cues corresponding to speech did not improve people’s understanding.

Experiment 2 tested whether an animation in which a circle changed in size with the speech signal could reduce listener effort without affecting performance. Using a different task, Strand and colleagues compared only the “audio-only” and “signal” conditions from the previous experiment, in which participants listened for nouns (like “cat”, “dog”, or “pizza”) in the same babble and had to repeat them aloud. This time, the authors analyzed both how accurate listeners were and how quickly they could do the task. Again, the researchers found that having a dynamic circle corresponding to the recording’s volume did not improve speaker accuracy, but they did find that every single participant was much faster (185 milliseconds on average) when the circle changed in size along with the recordings.

Even though not all visual information is created equal – seeing another person’s face improves speech recognition, but an abstract visual presentation of speech does not – getting visual information that aligns with what we hear can make listening in noise less difficult.

The research by Strand and colleagues shows that attention can be allocated more efficiently when listeners can process the same input using multiple sources of information, even though listeners do not become more accurate. This has important consequences for people like me who use hearing aids and rely on non-acoustic sources of information more than people with normal hearing.

It is also worth mentioning that this paper provides a great example of open, reproducible science. Not only were analyses pre-registered, but Strand and colleagues recruited a very large number of participants for an experiment in cognitive psychology (160 and 96 in Experiments 1 and 2, respectively), and put all of their stimuli, analysis code, data, and software online on the Open Science Framework website. As the Psychonomic Society moves more and more toward best practices for reproducibility, you can expect to see more articles like this one.

Psychonomics article highlighted in this post:

Strand, J. F., Brown, V. A., & Barbour, D. L. (2018). Talking points: A modulating circle reduces listening effort without improving speech recognition. Psychonomic Bulletin & Review, DOI: 0.3758/s13423-018-1489-7.

Author

  • Cassandra Jacobs is a graduate student in Psychology at the University of Illinois. Before this, she was a student of linguistics, psychology, and French at the University of Texas, where she worked under Zenzi Griffin and Colin Bannard. Currently she is applying machine learning methods from computer science to understand human language processing under the direction of Gary Dell.

    View all posts

The Psychonomic Society (Society) is providing information in the Featured Content section of its website as a benefit and service in furtherance of the Society’s nonprofit and tax-exempt status. The Society does not exert editorial control over such materials, and any opinions expressed in the Featured Content articles are solely those of the individual authors and do not necessarily reflect the opinions or policies of the Society. The Society does not guarantee the accuracy of the content contained in the Featured Content portion of the website and specifically disclaims any and all liability for any claims or damages that result from reliance on such content by third parties.

You may also like