Skip to content ↓

How the brain handles the “cocktail party problem”

Using a computational model, neuroscientists showed how the brain can selectively focus attention on one voice among others in a noisy environment.

Press Contact:

Sarah McDonnell
Phone: 617-253-8923
Fax: 617-258-8762
MIT News Office

Media Download

Josh McDermott and Ian Griffith are in a room filled with speakers, curving around them like a cave.
Download Image
Caption: Josh McDermott (left), professor of brain and cognitive sciences and associate investigator at the McGovern Institute sits with graduate student Ian Griffith in the speaker array room where they conducted the study.
Credits: Photo: Steph Stevens

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license. You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

Close
Josh McDermott and Ian Griffith are in a room filled with speakers, curving around them like a cave.
Caption:
Josh McDermott (left), professor of brain and cognitive sciences and associate investigator at the McGovern Institute sits with graduate student Ian Griffith in the speaker array room where they conducted the study.
Credits:
Photo: Steph Stevens

MIT neuroscientists have figured out how the brain is able to focus on a single voice among a cacophony of many voices, shedding light on a longstanding neuroscientific phenomenon known as the cocktail party problem.

This attentional focus becomes necessary when you’re in any crowded environment, such as a cocktail party, with many conversations going on at once. Somehow, your brain is able to follow the voice of the person you’re talking to, despite all the other voices that you’re hearing in the background.

Using a computational model of the auditory system, the MIT team found that amplifying the activity of the neural processing units that respond to features of a target voice, such as its pitch, allows that voice to be boosted to the forefront of attention.

“That simple motif is enough to cause much of the phenotype of human auditory attention to emerge, and the model ends up reproducing a very wide range of human attentional behaviors for sound,” says Josh McDermott, a professor of brain and cognitive sciences at MIT, a member of MIT’s McGovern Institute for Brain Research and Center for Brains, Minds, and Machines, and the senior author of the study.

The findings are consistent with previous studies showing that when people or animals focus on a specific auditory input, neurons in the auditory cortex that respond to features of the target stimulus amplify their activity. This is the first study to show that extra boost is enough to explain how the brain solves the cocktail party problem.

Ian Griffith, a graduate student in the Harvard Program in Speech and Hearing Biosciences and Technology, who is advised by McDermott, is the lead author of the paper. MIT graduate student R. Preston Hess is also an author of the paper, which appears today in Nature Human Behavior.

Modeling attention

Neuroscientists have been studying the phenomenon of selective attention for decades. Many studies in people and animals have shown that when focusing on a particular stimulus like the sound of someone’s voice, neurons that are tuned to features of that voice — for example, high pitch — amplify their activity.

When this amplification occurs, neurons’ firing rates are scaled upward, as though multiplied by a number greater than one. It has been proposed that these “multiplicative gains” allow the brain to focus its attention on certain stimuli. Neurons that aren’t tuned to the target feature exhibit a corresponding reduction in activity.

“The responses of neurons tuned to features that are in the target of attention get scaled up,” Griffith says. “Those effects have been known for a very long time, but what’s been unclear is whether that effect is sufficient to explain what happens when you’re trying to pay attention to a voice or selectively attend to one object.”

This question has remained unanswered because computational models of perception haven’t been able to perform attentional tasks such as picking one voice out of many. Such models can readily perform auditory tasks when there is an unambiguous target sound to identify, but they haven’t been able to perform those tasks when other stimuli are competing for their attention.

“None of our models has had the ability that humans have, to be cued to a particular object or a particular sound and then to base their response on that object or that sound. That’s been a real limitation,” McDermott says.

In this study, the MIT team wanted to see if they could train models to perform those types of tasks by enabling the model to produce neuronal activity boosts like those seen in the human brain.

To do that, they began with a neural network that they and other researchers have used to model audition, and then modified the model to allow each of its stages to implement multiplicative gains. Under this architecture, the activation of processing units within the model can be boosted up or down depending on the specific features they represent, such as pitch.

To train the model, on each trial the researchers first fed it a “cue”: an audio clip of the voice that they wanted the model to pay attention to. The unit activations produced by the cue then determined the multiplicative gains that were applied when the model heard a subsequent stimulus.

“Imagine the cue is an excerpt of a voice that has a low pitch. Then, the units in the model that represent low pitch would get multiplied by a large gain, whereas the units that represent high pitch would get attenuated,” Griffith says.

Then, the model was given clips featuring a mix of voices, including the target voice, and asked to identify the second word said by the target voice. The model activations to this mixture were multiplied by the gains that resulted from the previous cue stimulus. This was expected to cause the target voice to be “amplified” within the model, but it was not clear whether this effect would be enough to yield human-like attentional behavior.

The researchers found that under a variety of conditions, the model performed very similarly to humans, and it tended to make errors similar to those that humans make. For example, like humans, it sometimes made mistakes when trying to focus on one of two male voices or one of two female voices, which are more likely to have similar pitches.

“We did experiments measuring how well people can select voices across a pretty wide range of conditions, and the model reproduces the pattern of behavior pretty well,” Griffith says.

Effects of location

Previous research has shown that in addition to pitch, spatial location is a key factor that helps people focus on a particular voice or sound. The MIT team found that the model also learned to use spatial location for attentional selection, performing better when the target voice was at a different location from distractor voices.

The researchers then used the model to discover new properties of human spatial attention. Using their computational model, the researchers were able to test all possible combinations of target locations and distractor locations, an undertaking that would be hugely time-consuming with human subjects.

“You can use the model as a way to screen large numbers of conditions to look for interesting patterns, and then once you find something interesting, you can go and do the experiment in humans,” McDermott says.

These experiments revealed that the model was much better at correctly selecting the target voice when the target and distractor were at different locations in the horizontal plane. When the sounds were instead separated in the vertical plane, this task became much more difficult. When the researchers ran a similar experiment with human subjects, they observed the same result.

“That was just one example where we were able to use the model as an engine for discovery, which I think is an exciting application for this kind of model,” McDermott says.

Another application the researchers are pursuing is using this kind of model to simulate listening through a cochlear implant. These studies, they hope, could lead to improvements in cochlear implants that could help people with such implants focus their attention more successfully in noisy environments.

The research was funded by the National Institutes of Health.

Related Links

Related Topics

Related Articles

More MIT News