MoSa: Motion Generation with Scalable Autoregressive Modeling

Submitted to TVCG 2025

Our MoSa demonstrates a generation process from coarse to fine. The interactive program has been released, and you can adjust the "Scales" slider to intuitively experience this process! 🔥🔥🔥

Abstract

We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine varying-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask’s 0.20) while reducing inference time by 27\%. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning.

Framework Overview


Gallery of Generation

"A man is walking forward then steps over an object then continues walking forward."


"A person steps forward explosively, driving his whole body into a crushing straight punch."


"a person is pushing a shopping cart."


"a person jumps forwards and turns left in mid air."


"a person was pushed but did not fall."


"a man side steps to each side, rotating his body with each side step."


"A person steps forward explosively, driving his whole body into a crushing straight punch."


"a person stands up from a kneeling position, using their right arm to help themselves up."


"a person is making a high kick with his left leg."


"a person jogs in place, slowly at first, then increases speed."


"The person jumps in the air while doing a kick spin to the right."


"A person lifts his left foot above his head and grabs his left foot with his right hand."


"the person is throwing a baseball."


"the figure is appears to be grabbing something for balance at shoulder height with their right hand as they balance with on foot with their left foot and twist their right from side to side, and side to side again."


"A man rises from the ground, walks to the left and sits back down on the ground."


Apply in Motion Editing


Benefiting from our scalable autoregressive modeling paradigm, motion generation at each scale can bidirectionally attend to intra-scale context and all preceding scale representations. Leveraging this design, we further explore a compelling application of our model—Motion Editing—which requires no additional training. Motion Editing encompasses a variety of sub-tasks, including motion inpainting, outpainting, prefix filling, suffix filling, and free-form motion completion. A visualization of the results is shown below, where input motion clips are highlighted in pink, and the generated motions are depicted in red.


Motion Inpainitng (In-betweening)

Generating 50% motion in the middle based on the text "person is working on their boxing form." conditioned on first 25% and last 25% of motion of "a person walks forward while being assisted by hand rails."

Source Motion

+ "person is working on their boxing form."


Generating 50% motion in the middle based on the text "a person jogs in place." conditioned on first 25% and last 25% of motion of "a person walks half a circle clockwise, then another half circle counter-clockwise."

Source Motion

+ "a person jogs in place."


Motion Outpainitng

Generating 25% beginning motion and 25% end motion based on the text "a person does a jumping jack." conditioned on the 50% middle motion of "a person appears to be playing tennins."

Source Motion

+ "a person does a jumping jack."


Generating 25% beginning motion and 25% end motion based on the text "someone is walking diagonally across the screen" conditioned on the 50% middle motion of "person is doing kicking motions with left leg."

Source Motion

+ "someone is walking diagonally across the screen"


Prefix filling

Generating 50% beginning motion based on the text "a person squats down and stands up." conditioned on the 50% last motion of "a person pretends to be a dinosuar."

Source Motion

+ "a person squats down and stands up."


Generating 50% beginning motion based on the text "a person is making a high kick with his left leg." conditioned on the 50% last motion of "a person acts in a shy way while walking."

Source Motion

+ "a person is making a high kick with his left leg."


Suffix filling

Generating 50% end motion based on the text "a person walking forward in slow motion." conditioned on the 50% beginning motion of "person is walking backwards."

Source Motion

+ "a person walking forward in slow motion."


Generating 50% end motion based on the text "a person is sitting down on the ground." conditioned on the 50% beginning motion of "a person appears to be playing the violin."

Source Motion

+ "a person is sitting down on the ground."


Free-form Motion Completion

The motion completion task operates without language conditioning. (Pink=Input, Red=Generation)

Analysis in Generation


To validate the effectiveness of MoSa's coarse-to-fine generation, we visualize the generation process. The motion at the intermediate scale is presented. At the initial stage (eg, Scale 2, 4), the generated motion exhibits key poses but lacks proper limb coordination. As the generation progresses (eg, Scale 8, 10), the poses become increasingly natural, with more refined details.


"a person doing a wushu jump."

Scale 2

Scale 4

Scale 6

Scale 8

Scale 10


"He ran a few steps and jumped into the air."

Scale 2

Scale 4

Scale 6

Scale 8

Scale 10


"taking strides forward, pivot swiftly on the left foot, and then walk the other way."

Scale 2

Scale 4

Scale 6

Scale 8

Scale 10


"The person performs a left knee tuck snap dow."

Scale 2

Scale 4

Scale 6

Scale 8

Scale 10

Comparisons



"The runner slows down, taking deep breaths to recover energy."

T2M-GPT

MLD

MoMask

MoSa (Ours)


"a person doing kung fu pose very fast."

T2M-GPT

MLD

MoMask

MoSa (Ours)


"a person with knees bent, curls up by hunching over, and then stands straight up."

T2M-GPT

MLD

MoMask

MoSa (Ours)


"a person turns around and quickly runs forward before doing kick."

T2M-GPT

MLD

MoMask

MoSa (Ours)



BibTeX

BibTex Code Here