Decoding Emotion in Voice Responses
As a listening company, voice data is at the heart of everything we do at inVibe. Hearing how people speak (in addition to what they say) reveals emotional signals that transcripts alone can’t capture. While a single emotional snippet may compel an audience in the moment, it isn’t enough to drive confident decision-making. That’s where inVibe’s speech emotion recognition (SER) model comes in. SER allows us to measure the emotionality in respondents’ voices in a rigorous, consistent, and visualizable way, turning fleeting impressions into reliable insight.
Quantifying Emotion in Acoustic Signals
Our SER model measures roughly 40 acoustic features within every response that range from the intuitive (pitch, loudness, and speed) to the subtle (voice onset, signal energy, vowel formants). These features are then aggregated into three validated emotional dimensions:
- Valence: Positivity or negativity. Think: happy versus upset.
- Activation: Engagement or interest. Think: excited versus bored.
- Dominance: Confidence or control. Think: assertive versus uncertain.
These three dimensions have been shown to outperform direct emotion classification because they capture the universal building blocks of emotional expression.
A Model Built for Real-World Voices
To ensure our SER model performs reliably across speakers and cultural backgrounds, our model has been trained on a large, cross-cultural dataset that provides it with the universal, speaker-independent features of each emotional dimension. And because each voice is different, we normalize each respondent’s emotional measurements against a personal baseline collected at the beginning of the survey. This ensures that our acoustic measurements are free from bias.
Visualizing the Emotionality in Voice Data
Once the acoustic signals are quantified, we visualize them to make the emotional patterns immediately clear. Plotting two acoustic dimensions on the same chart allows use to see how responses cluster and what those clusters reveal. For example, the below visualization shows the emotionality in the voices of people with psoriasis as they discuss their experience living with this condition.

Valence (positivity or negativity) is plotted on the x-axis, activation (engagement or interest) is plotted on the y-axis. Each dot represents the acoustic character of one patient’s response:
Visualizing emotionality in this way allows us to show our clients the resignation that we hear in the responses:
- “Living with psoriasis is very frustrating because of the flare-ups, and people look at you funny when you have splotches all over your skin.”
- “Okay, so, living with psoriasis can be physically uncomfortable and very emotionally draining.”
- “It's a condition that makes me feel really self-conscious and has caused me to lose a lot of confidence in myself.”
Integrating these vocal cues with the language people use gives us a more detailed understanding of their lived experience, bringing emotional nuance into focus in a way that is both systematic and human.
Getting the Full Story with inVibe
Emotion heard in the voice is powerful, but emotion measured is transformative. By pairing our acoustic analyses with our linguistic expertise, we help our clients move beyond intuition toward clear, defensible insights. If you’re interested in listening more closely to your stakeholders, reach out and schedule a demo to see for yourself!