UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
Published in ICASSP, 2025
The paper introduces UMETTS, a multimodal emotional text-to-speech (E-TTS) framework that leverages emotional cues from text, audio, and visual inputs. The proposed system incorporates an Emotion Prompt Alignment Module (EP-Align) and an Emotion Embedding-Induced TTS Module (EMI-TTS) to generate expressive and emotionally resonant speech.
Recommended citation: Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Junyao Chen, Xiaomao Fan, Xiaojiang Peng, Alexander G Hauptmann (2025). "UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts; 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Download Paper