Everybody Rap Now: Coherent Vocals and Whole-body Motions Generations from Text

Jiaben Chen1, Xin Yan2, Yihang Chen3, Siyuan Cen4, Qinwei Ma5, Haoyu Zhen6, Kaizhi Qian7, Lie Lu8, Chuang Gan1,7
1UMass Amherst, 2Wuhan University, 3Zhejiang University, 4Nanjing University, 5Tsinghua University, 6Shanghai Jiao Tong University, 7MIT-IBM Watson AI Lab, 8Dolby Laboratories

Abstract

In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address two modalities in isolation (text-to-motion, text-to-audio, or audio-to-motion).

To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes.

With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs, but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation.

RapVerse Dataset

RapVerse is a large-scale dataset featuring a comprehensive collection of lyrics, singing vocals, and 3D whole-body motions.

Generation Examples

Given a textual lyric input, we propose a novel task for generating coherent singing vocals and whole-body human motions (including body motions, hand gestures, and facial expressions) simultaneously.

Input Lyric: Ai nothing new under the sun today I just pray to God that I ai gonna be the one today If I shall perish I rest my soul.

Input Lyric: So you know when they keep it and I making love she coming with me through the time hey without it out of.

Input Lyric: In all black and to see I was plotting back when I was seventeen.

Input Lyric: Who chase an objective with mask in a miracle hope the promise lay in one that imperial slapping up how it on top to a mountain where tigers devour your flesh and.

Input Lyric: That tripled up my wealth of futures and generations I been waiting patient for myself and the universe for confirmation.

Input Lyric: All the sacks on the lot that we make Made in brain I could like a hurricane.