CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
CoRR(2024)
摘要
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to
significant strides in generating high-fidelity and diverse speech. However,
dialogue generation, along with achieving human-like naturalness in speech,
continues to be a challenge in the field. In this paper, we introduce CoVoMix:
Conversational Voice Mixture Generation, a novel model for zero-shot,
human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix is
capable of first converting dialogue text into multiple streams of discrete
tokens, with each token stream representing semantic information for individual
talkers. These token streams are then fed into a flow-matching based acoustic
model to generate mixed mel-spectrograms. Finally, the speech waveforms are
produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of
metrics for measuring the effectiveness of dialogue modeling and generation.
Our experimental results show that CoVoMix can generate dialogues that are not
only human-like in their naturalness and coherence but also involve multiple
talkers engaging in multiple rounds of conversation. These dialogues, generated
within a single channel, are characterized by seamless speech transitions,
including overlapping speech, and appropriate paralinguistic behaviors such as
laughter. Audio samples are available at https://aka.ms/covomix.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要