How Your Brain Tells Speech and Music Apart

Simple cues help people to distinguish song from the spoken word

Illustration of ear with sound waves on blue background

Peterschreiber.media/Getty Images

People generally don’t confuse the sounds of singing and talking. That may seem obvious, but it’s actually quite impressive—particularly because we can usually differentiate between the two even when we encounter a language or musical genre we’ve never heard before. How exactly does the human brain so effortlessly and instantaneously make such judgments?

Scientists have a relatively rich understanding of how the sounds of speech are transformed into sentences and how musical sounds move us emotionally. When sound waves hit our ear, they activate the auditory nerve within a part of the inner ear called the cochlea. That, in turn, transmits signals to the brain. These signals travel the so-called auditory pathway, first to the subregion for processing all kinds of sounds and then to dedicated music or language subregions. Depending on where the signal ends up, a person comprehends the sound as a particular type of meaningful information—and can distinguish an aria from a spoken sentence.

That’s the broad-­strokes story of sound processing. But it remains surprisingly unclear how exactly our perceptual system differentiates these sounds within the auditory pathway. Certainly there are clues: music and speech waveforms have distinct pitches (tones sounding high or low), timbres (qualities of sound), phonemes (speech-sound units) and melodies. But the brain’s auditory pathway does not process all those elements at once. Consider the analogy of sending a letter in the mail from, say, New York City to Taipei. The letter’s contents provide a detailed explanation of its purpose, but the envelope still must indicate its destination. Similarly, even though speech and music are packed with information, our brain needs some basic cues to rapidly determine which regions to engage.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


The question for neuroscientists is how the brain decides whether to send incoming sound to the language or music regions for detailed processing. My colleagues at New York University, the Chinese University of Hong Kong and the National Autonomous University of Mexico and I decided to investigate this mystery. In a study published last spring, we present evidence that a simple property of sound called amplitude modulation—which describes how rapidly the volume, or “amplitude,” of a series of sounds changes over time—is a key clue in the brain’s rapid acoustic judgments. Our findings hint at the distinct evolutionary roles music and speech have had for the human species.

Past research had shown that the ampli­tude-­mod­u­la­tion rate of speech is highly consistent across languages, measuring four to five hertz, meaning four to five ups and downs in the sound wave per second. Meanwhile the amp­­li­­tude-­mod­u­la­tion rate of music is consistent across genres, at about one to two hertz. Put another way: when we talk, the volume of our voice changes much more rapidly in a given span of time than it does when we sing.

Given the cross-­cultural consistency of this pattern, we wondered whether amplitude modulation might reflect a universal biological signature that plays a critical role in how the brain distinguishes speech and music. We created special audio clips of white noise in which we adjusted how rapidly volume and sound changed over time. We also adjusted how regularly such changes occurred—that is, whether the audio had a reliable rhythm. We used these white noise clips rather than realistic audio recordings to better control for the effects of amplitude modulation, as opposed to other aspects of sound, such as pitch or timbre.

Across four experiments with more than 300 participants, we asked people to listen to these audio files and tell us whether they sounded more like speech or music. The results revealed a strikingly simple principle: audio clips with slower amplitude-modulation rates and more regular rhythms were more likely to be judged as music, and the opposite pattern applied for speech. This suggests that our brain associates slower, more regular changes in amplitude with music and faster, irregular changes with speech.

These findings inspire deeper questions about the ­human mind. First, why are speech and music so distinct in their amplitude over time? Evolutionary hypotheses offer some possible answers. Humans use speech for communication. When we talk, we engage muscles in the vocal tract, including the jaw, tongue and lips. A comfortable speed for moving these muscles for talking is around four to five hertz. Interestingly, our auditory perception of sound at this speed is enhanced. This alignment in speed, production and perception is probably not a coincidence. A possible explanation is that humans talk at this neurophysiologically optimized fast speed to ensure efficient information exchange—and this fast talking could explain the higher amplitude-modulation rate in speech versus music.

Separately, one hypothesis about the evolutionary origin of music is that it effectively builds social bonds within a society by coordinating multiple people’s activities and movement, such as through parent-­infant interactions, group dancing and work songs. Studies have shown that people bond more closely when they move together in synchrony. Therefore, it’s possible that for music to serve its evolutionary function, it needs to be at a speed that allows for comfortable human movement, around one to two hertz or lower. Additionally, a predictable beat makes the music more appealing for dancing in a group.

There are still many questions to explore. More studies are needed on whether the brain is able from birth to separate music and speech based on acoustic modulation or whether it relies on learned patterns. Understanding this mechanism could help patients with aphasia, a condition that affects verbal communication, comprehend language via music with carefully tuned speed and regularity. Our evolutionary concepts, too, warrant further investigation. Diverse hypotheses exist around the evolutionary origins of music and speech, which could spur other studies. And more cross-­cul­tur­al research could ensure that these ideas really hold up across all communities.

As to how the brain distinguishes music from speech in the auditory pathway, we suspect there is more to uncover. Amplitude modulation is most likely just one factor—one line, perhaps, on the addressed envelope—that can help explain our brain’s amazing capacity for discernment.

Are you a scientist who specializes in neuroscience, cognitive science or psychology? And have you read a recent peer-reviewed paper that you would like to write about for Mind Matters? Please send suggestions to Scientific American’s Mind Matters editor Daisy Yuhas at dyuhas@sciam.com.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.

Andrew Chang is a postdoctoral fellow at New York University, whose work has been supported by the National Institutes of Health’s Ruth L. Kirschstein Postdoctoral Individual National Research Service Award and by the Leon Levy Scholarships in Neuroscience. He studies the neural mechanisms of auditory perception and the ways people use music and speech to interact in the real world.

More by Andrew Chang