DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
NeurIPS(2023)
摘要
In this paper, we introduce self-distillation and online clustering for
self-supervised speech representation learning (DinoSR) which combines masked
language modeling, self-distillation, and online clustering. We show that these
concepts complement each other and result in a strong representation learning
model for speech. DinoSR first extracts contextualized embeddings from the
input audio with a teacher network, then runs an online clustering system on
the embeddings to yield a machine-discovered phone inventory, and finally uses
the discretized tokens to guide a student network. We show that DinoSR
surpasses previous state-of-the-art performance in several downstream tasks,
and provide a detailed analysis of the model and the learned discrete units.
更多查看译文
关键词
online clustering,speech,learning,self-distillation,self-supervised
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要