Deepfake, earwitnesses, and discrimination: Your voice matters

Technology has changed dramatically since I started as a Digital Associate Editor for the Psychonomic Society digital team almost 10 years ago. According to Google AI Generator,

“Since 2016, technology has advanced significantly, primarily driven by major breakthroughs and widespread adoption of Artificial Intelligence (AI) and Machine Learning. Other key areas of advancement include 5G connectivity, biotechnology, autonomous systems, and immersive technologies like Augmented Reality (AR) and Virtual Reality (VR).”

These advancements have propelled our understanding of the mind and body while enhancing almost every aspect of life. As Google AI detailed:

AI has moved from a niche field to an integral part of daily life and a primary driver of innovation. Generative AI has seen explosive growth, enabling the creation of new content (text, images, audio) using advanced models, revolutionizing industries from art to healthcare. AI as a Collaborator has evolved from simple assistants to sophisticated collaborators, aiding in complex tasks like diagnosing diseases and personalizing education. Natural Language Processing (NLP) has advanced significantly, powering highly effective chatbots, translation tools, and voice assistants that can understand and respond to human language more naturally.

Connectivity and Infrastructure [have increased exponentially, allowing for our global pandemic in 2020 to be mitigated as we harnessed the power of technology to help us return to “normalcy” with some new perks after the global devastation]. 5G Technology [rolled] out in 2016, providing substantially faster speeds, lower latency, and greater capacity, which is the foundation for other real-time applications like autonomous vehicles and remote surgery. Internet of Things (IoT) and Edge Computing interconnected billions of sensors and devices, [which] have digitized the physical world, enabling smart cities, homes, and industrial applications. Edge computing processes data closer to the source, reducing latency for real-time applications. Cloud services have become the default for businesses, with a recent focus on serverless computing, offering greater scalability and cost-efficiency [not to mention the convenience of storing thousands of photos and videos efficiently].

Biotechnology and Medicine [has benefitted from these technological advancements, including areas such as] Genomics and Personalized Medicine in which advances in DNA sequencing and AI-driven data analysis have paved the way for treatments tailored to an individual’s genetic makeup, with breakthroughs in cancer therapies and rare disease treatments becoming more mainstream. [The opportunities for] Gene Editing (e.g., CRISPR) have gained momentum, offering new avenues for treating diseases and enhancing health.

Immersive Experiences and Computing [once the realm of video gaming and military training, today the general public has access to] Augmented Reality (AR) and Virtual Reality (VR) due to major advancements that have moved beyond gaming into areas like remote collaboration, education, and retail (virtual try-ons). The concept of the Metaverse, a persistent shared virtual space, has also gained prominence.

While these advancements have led to exciting breakthroughs and increases in efficiency, there is also a dark side, as is expected of all stories. In the case of the article summarized in this post, it is the looming threat of AI-generated audio and video – or “deepfake” content.

As defined by Google AI generator,

“Deepfake technology uses AI to create realistic, manipulated videos, audio, or photos that can make it seem like someone said or did something they never did. This is achieved through deep learning algorithms, such as Generative Adversarial Networks (GANs), which involve training two neural networks against each other to generate and detect fakes. While deepfakes have potential uses in entertainment and communication, they are also used for malicious purposes like creating deepfake pornography, disinformation, and fraud.”

I have often found myself scrolling through the content that Meta Facebook algorithms have selected for me (usually animals doing fun or cool things) and thinking how cool the video was, only to later have my daughter say, “Mom, that’s fake!” when I shared it with her. Thankfully, I am not the only poor soul to be duped by AI-generated content. According to an article published on the UNESCO website, the author reported from the Statista website in 2024 that 46% of fraud experts have encountered synthetic identity fraud, 37% voice deepfakes, and 29% video deepfakes. This newest wave of fraud has even been incorporated into our annual technology training requirements!

As the authors Vyshnevetska, Giroud, Ramon, and Dellwo (pictured below) summarized,

“Recently, ‘fake police officer’ crimes were reported in many countries, whereby older adults are typically contacted by younger persons pretending to be police officials, attempting to deceive older adults out of their valuables. In these scenarios, older individuals can be persuaded to give over their possessions or transfer money to unfamiliar younger adults who are part of an arranged scam. In addition, deepfake voices, i.e., voices cloned using artificial intelligence techniques, are gaining popularity, enabling the creation of new speech utterances using someone’s cloned voice. These audio deepfakes have led to crimes in which victims are tricked into believing that someone among their dear ones is in urgent need of financial help. In both types of crimes, the ages of perpetrators and victims may differ drastically. Thus, an earwitness in court might have to testify whether the voice of a suspect from a significantly different age group belongs to a speaker they spoke to on the telephone during the voice crime. However, it is unclear whether listeners have a perceptual advantage for processing voices of their own age compared to other ages.”

*Images of authors of the article highlighted in this post. From left to right – Valeriia Vyshnevetska, Nathalie Giroud, Meike Ramon, and Volker Dellwo.*

This real-world problem led these authors to conduct a controlled experiment in which they investigated voice discrimination abilities of adults of different ages in a study published by the Psychonomic Society’s Cognitive Research: Principles & Implications journal.

In the experiment, younger (between 19–35 years) and older (65–83 years) listeners took part in a voice discrimination task, which included younger and older voices. In a voice discrimination task, participants are presented with pairs of voices and have to judge whether both voice samples were spoken by the same or different speakers.

The figure below shows the fundamental frequency ranges of the four different voice conditions tested (identified by different colors: female older adult (OA, purple), female younger adult (YA, blue), male older adult (OA, turquoise-green), and male younger adult (YA, yellow)). The left figure shows each group’s distribution and the degree of overlap, with the ages of the voices showing significant overlap in range, while male and female voices show distinct distributions. The dashed vertical lines in the figure on the left represent the means of each group (identified by the corresponding colors).

Figure from Vyshnevetska et al. (2025) detailing the speaker fundamental frequency ranges used for the four different voice conditions – female older adult (OA, purple), female younger adult (YA, blue), male older adult (OA, turquoise-green), and male younger adult (YA, yellow).

The authors used a signal detection theory approach to evaluate the data. Signal detection operates on a ratio of signal to background noise and influences decisions about whether a signal is present or absent based on the strength of the signal and their confidence in that decision. There are four outcomes: correctly identifying the presence of a signal (“hit”), correctly identifying the absence of a signal (“correct rejection”), detecting a signal when it wasn’t there (“false alarm”), or failing to detect an actual signal (“miss”).

Vyshnevetska and co-authors assessed sensitivity (d’) and response bias (c) for their primary outcomes. Sensitivity (d’) is generally considered to be the ability to perceive a signal. Take the boxplots and raincloud plots in the figure below – younger adult listeners were more sensitive in their discrimination of voice differences than older adult listeners (Plot A), while males and female listeners were equally sensitive (Plot B). However, participants showed a greater sensitivity to perceiving male voices than female voices (Plot D), although they were equally sensitive in detecting voices of older versus younger adults (Plot C).

*Figure from Vyshnevetska et al. (2025) summarizing sensitivity findings.*

Response bias (c) is the subject’s tendency to prefer one type of response over the other. As shown in the figure below, listeners did not show any preferences based on age (Plot A) or sex (Plot B). However, they did show a preference for young adult voices (Plot C) and for female voices (Plot D). That is, the authors “showed that when hearing younger speaker pairs and female speaker pairs, listeners are significantly biased to say that both excerpts stem from the same speaker.”

*Figure from Vyshnevetska et al. (2025) summarizing preference findings.*

Given the world today, this apparent perceptual bias may play a key role in forensic procedures. “For voice crime cases, this could imply that earwitnesses might find it more challenging to discriminate [between] younger speakers and female speakers, especially if the audio quality is poor, as is often the case for voice crimes.”

As Vyshnevetska and her co-authors emphasized, it is unclear whether these results reflect fundamental acoustic characteristics of younger and older voices and male and female ranges or some aspect of in-group/out-group biases that can be mediated through explicit training. Clearly, more research is needed to disentangle these potentially confusing acoustic features and their implications for human decision-making in a variety of contexts.

Unfortunately, I imagine this task may become harder as deepfake technology improves and becomes more human-like. However, this may be a moot point for the next generation, as they have grown up with the experience of discriminating between content that is real or AI-generated. [The key is the extra hand or robotic voice, supposedly!]

Featured Psychonomic Society article

Vyshnevetska, V., Giroud, N., Ramon, M., & Dellwo, V. (2025). Listeners are biased towards voices of young speakers and female speakers when discriminating voices. Cognitive Research: Principles & Implications, 10(1), 28. https://doi.org/10.1186/s41235-025-00636-3

Comment: The use of AI generative text was incorporated for part of this post, but did not seem to significantly improve the efficiency of the author.

Author

Heather Manitzas Hill

Heather Hill is a Professor at St. Mary’s University. She has conducted research on the mother-calf relationship and social development of bottlenose dolphins in human care. She also studied mirror self-recognition and mirror use in dolphins and sea lions. Most recently, she has been studying the social behavior and cognitive abilities of belugas, killer whales, Pacific white-sided dolphins, and bottlenose dolphins in human care. She has also been known to dabble in various aspects of human cognition and development, often at the intersection of those two fields.
View all posts