What acoustic correlates do researchers primarily rely on for classifying vocalizations in SER?
Pitch, energy, and speaking rate
The classification of vocalizations into emotional categories within Speech Emotion Recognition (SER) often begins by analyzing readily observable acoustic correlates inherent in human speech production. Pitch refers to the perceived highness or lowness of the voice, which frequently shifts dramatically based on arousal level—for instance, rising sharply with excitement or anger. Energy, related to the intensity or loudness of the voice, also serves as a strong indicator; emotions like anger typically manifest with higher energy levels than states such as sadness or boredom. Speaking rate, the speed at which words are articulated, provides temporal cues; rapid speech can signal anxiety or excitement, while slow speech might indicate sadness or deep thought. While advanced systems utilize more complex features like MFCCs, these three fundamental acoustic parameters—pitch, energy, and rate—form the initial, measurable foundation upon which researchers often attempt to build quantifiable models for emotional state detection.
