MoSa: Motion Generation with Scalable Autoregressive Modeling

Abstract

We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine varying-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask’s 0.20) while reducing inference time by 27\%. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning.

Framework Overview

Apply in Motion Editing

Benefiting from our scalable autoregressive modeling paradigm, motion generation at each scale can bidirectionally attend to intra-scale context and all preceding scale representations. Leveraging this design, we further explore a compelling application of our model—Motion Editing—which requires no additional training. Motion Editing encompasses a variety of sub-tasks, including motion inpainting, outpainting, prefix filling, suffix filling, and free-form motion completion. A visualization of the results is shown below, where input motion clips are highlighted in pink, and the generated motions are depicted in red.

Motion Inpainitng (In-betweening)

Generating 50% motion in the middle based on the text "person is working on their boxing form." conditioned on first 25% and last 25% of motion of "a person walks forward while being assisted by hand rails."

Source Motion

+ "person is working on their boxing form."

Generating 50% motion in the middle based on the text "a person jogs in place." conditioned on first 25% and last 25% of motion of "a person walks half a circle clockwise, then another half circle counter-clockwise."

Source Motion

+ "a person jogs in place."

Motion Outpainitng

Generating 25% beginning motion and 25% end motion based on the text "a person does a jumping jack." conditioned on the 50% middle motion of "a person appears to be playing tennins."

Source Motion

+ "a person does a jumping jack."

Generating 25% beginning motion and 25% end motion based on the text "someone is walking diagonally across the screen" conditioned on the 50% middle motion of "person is doing kicking motions with left leg."

Source Motion

+ "someone is walking diagonally across the screen"

Prefix filling

Generating 50% beginning motion based on the text "a person squats down and stands up." conditioned on the 50% last motion of "a person pretends to be a dinosuar."

Source Motion

+ "a person squats down and stands up."

Generating 50% beginning motion based on the text "a person is making a high kick with his left leg." conditioned on the 50% last motion of "a person acts in a shy way while walking."

Source Motion

+ "a person is making a high kick with his left leg."

Suffix filling

Generating 50% end motion based on the text "a person walking forward in slow motion." conditioned on the 50% beginning motion of "person is walking backwards."

Source Motion

+ "a person walking forward in slow motion."

Generating 50% end motion based on the text "a person is sitting down on the ground." conditioned on the 50% beginning motion of "a person appears to be playing the violin."

Source Motion

+ "a person is sitting down on the ground."

Free-form Motion Completion

The motion completion task operates without language conditioning. (Pink=Input, Red=Generation)

Analysis in Generation

To validate the effectiveness of MoSa's coarse-to-fine generation, we visualize the generation process. The motion at the intermediate scale is presented. At the initial stage (eg, Scale 2, 4), the generated motion exhibits key poses but lacks proper limb coordination. As the generation progresses (eg, Scale 8, 10), the poses become increasingly natural, with more refined details.