Comparing speech-to-text to human listening
January 10, 2023
One of these reasons for the current deceleration in the numbers of consumers buying and using smart assistants (e.g., Alexa) in their homes is privacy, that these devices are always “listening” to what people are saying, whether called upon to do so or not.
While consumers’ privacy concerns are valid — and as a privacy-by-design company, SoapBox takes our commitment to privacy very seriously — is the smart speaker really “listening”? Is that metaphor accurate or even appropriate?
Computers and humans process audio and sound in fundamentally different ways. While computers rely on algorithms and pre-programmed instructions to analyze and understand audio signals, humans use their brains and senses to interpret and make sense of sounds.
In this blog, we’ll explore the science behind speech-to-text (STT) and the differences between how computers and humans process audio.
Let’s start by exploring the evolutionary wonder of human listening.
How humans listen
When we hear a sound, our ears capture sound waves and pass them to the eardrum, creating vibrations, which are translated into electrical signals. These electrical signals are picked up by the auditory nerve and sent to the brain, where they are interpreted and understood.
This process takes milliseconds and provides for the parallel processing of multiple sounds on a continuous basis (even when we are asleep), a miracle of biological and neurological engineering.
Computers rely on pre-programmed instructions and algorithms, while humans use a combination of cognitive and sensory processing to make sense of sounds. Our brains are incredibly efficient at processing and understanding speech, even in noisy or challenging environments. We can easily distinguish between different speakers, accents, and languages, and we can understand the meaning of words and phrases even when they are spoken at different speeds or in different tones.
How humans listen also highlights the ephemeral quality of how we interpret sound. Unless we use an external recording device, there is no permanent record of the sound we hear, other than our own ability to recall it from memory. And that’s why, when it comes to speech, we use all sorts of fun phrases to underline the fleeting nature of what we hear: “between you and me,” “off the record,” “one person’s word against the other,” etc.
Now, let’s compare how humans listen to how computers process sound.
Understanding speech-to-text technology
At its core, speech-to-text technology — also known as automatic speech recognition (ASR) or, more broadly, voice technology or voice AI — is a type of artificial intelligence (AI) that enables computers to recognize and transcribe spoken words into written text or a digital format.
How does a computer process audio?
The process begins with the computer receiving an audio signal from a microphone or other sound input device (“the outer ear”). These sound waves then need to be translated into digital data that a computer (“the brain”) can interpret.
This is done in a number of steps, first by digitizing the sound waves, then normalizing to take account of background noise, volume, and accent before finally breaking up the signal into tiny segments that can be compared against the phonemes that we use in language to make up words.
To determine what was being said, the computer needs to use powerful and complex statistical models, mathematical functions, and machine learning algorithms.
These algorithms are trained on large datasets of audio and corresponding transcriptions, and they learn to recognize patterns in the audio that correspond to certain words or phrases. When presented with a new audio signal, the algorithm can use this learned knowledge to transcribe the speech into text.
Speech-to-text transcripts are permanent
This is one of the main differences between STT and how we process speech as humans.
The fundamental output of speech-to-text is a transcript, a permanent record of what was (most likely) said. So rather than being ephemeral, the spoken word is translated into a permanent digital record as a transcription output.
In addition to a transcript, other permanent secondary outputs can also be distinguished, such as speaker identification, accent, emotions, and more.
Why is understanding speech-to-text important?
STT enables us to generate voice data from the original recording, which, when further analyzed using natural language understanding (NLU), for example, can be used to infer meaning, intent, and gather detailed personal information.
A huge amount of value and benefit can be generated for users from gathering and analyzing their voice data. Take for example our product SoapBox Fluency, which analyzes the voice recordings of children reading aloud to automatically and accurately assess, down to the phoneme level, how well they’re progressing on their literacy journey.
Privacy first
So, while computers using STT may not be “listening” to humans in exactly the way we would listen to each other, they do have the potential to be incredibly powerful personal data accumulation tools.
The collection, processing, and analysis of consumer data must adhere to a privacy-first approach that respects each individual’s fundamental rights (i.e., for their data to remain anonymous and for it not to be sold or reused for marketing or profiling purposes). Privacy first becomes an even greater consideration when the voice data needing protection belongs to children.