EEG Speech Emotion Conversion Demo

Comparison of Speech Emotion Conversion Models Based on EEG Emotional Features (Audio Sample Demonstration)

Abstract

Speech emotion conversion aims to transform the emotional expression of source speech into a target emotion. Although recent studies mainly rely on speech-derived emotional representations, the use of electroencephalography (EEG) signals as emotional conditioning inputs remains relatively underexplored. This work presents an EEG-conditioned speech emotion conversion framework that incorporates EEG-derived emotional representations through cross-modal alignment. To address the modality discrepancy between EEG and speech signals, a three-stage training strategy is adopted, including speech-side pretraining, EEG–speech representation alignment, and EEG-conditioned joint optimization. Different EEG emotion encoder architectures are further investigated to model emotion-related neural representations. Experiments on a synchronized EEG–speech dataset show that explicit EEG emotional representation modeling provides more consistent emotional guidance than directly feeding time-aligned EEG features into the emotion conversion model without explicit emotional modeling. Among the evaluated architectures, the CNN+Transformer encoder achieves the best overall performance, yielding the lowest average emotion embedding distance of 0.1958 and the highest direction consistency of 0.4333. Subjective evaluations further demonstrate improved emotional expressiveness and speech naturalness, with an overall mean opinion score (MOS) of 4.1. These results support the feasibility of incorporating EEG signals as emotional conditioning inputs for speech emotion conversion.