LS-ReMGM: Latent-aligned Semantic-guided Reaction Motion Generation Model

1***2***

2***

One person approaches, rotates his/her body to the right, lifts his/her right foot, and kicks the other person's left lower leg. The other person quickly steps back.

Abstract

Modeling virtual characters that can react to the actions of another character or human can benefit automated computer animation, human-robot interaction, and social behavior generation in digital environments. Despite significant progress in motion generation, most existing works focus on generating motion for a single character, while the generation of reaction motion, particularly in the context of human interactions conditioned solely on action sequences, remains largely understudied. Moreover, for optimal action-reaction motion mapping, the learned latent space must be (1) highly disentangled, meaning similar motions should be closer and dissimilar motions should be apart in the latent space and (2) action and reaction motion spaces must be closely aligned, that is, two variables having the same value sampled from action and reaction distributions must correspond to the correctly mapped action-reaction motion pair. Moreover, effective motion representation is crucial for models to comprehend the underlying motion structures and offer semantic guidance. It is essential to note that there exists a trade-off between the representation level and motion semantics. Higher-level representations, such as motion class labels or textual descriptions, lack fine-grained motion information. Conversely, lower-level representations, such as joint locations or orientations, encounter precision problems and introduce complexity during the training phase. Therefore, an intermediate representation provides more effective motion semantics. Various schemes have been proposed to effectively represent motion, including pose tokens, motion descriptors, global/local motion cues, and kinematic phrases. Although these schemes excel in motion recognition and classification tasks, they often struggle to generalize well to motion generation tasks. We propose LS-ReMGM, a novel reaction motion generation model based on a dual-encoder CVAE, designed to produce semantically aligned human reactions. It comprises two encoders to learn the action and reaction motion spaces independently, with a shared decoder generating the reaction-motion sequence. Our objectives are twofold: (1) to enhance action-reaction mapping by effectively regularizing and aligning the two motion spaces. For this, we enforce the two encoders to learn similar probability distributions while disentangling the latent spaces with an enhanced conditional signal. (2) to provide a better motion representation and capture the nuances of the underlying motion structures. To address this issue, we propose novel quantized motion tokens and atomic action vectors as rich intermediate motion representations. The LS-ReMGM uses an action-motion sequence $ x_a^{1:U}$ and extracts quantized motion tokens and atomic action vectors to generate the conditional signal. The two encoders and a decoder use this conditional signal as a bias to disentangle and regularize the motion spaces. This results in an improved the action-reaction mapping. The reaction-motion encoder encodes the reaction-motion sequence $ x_r^{1:V}$. The decoder then reconstructs the corresponding reaction-motion by learning a mapping function. Moreover, the guided alignment at two encoders ensures that the two distributions are similar. This allows the decoder to sample a variable from the reaction space during training and from the action space during inference.

Proposed Method

(Hover the mouse over image to Zoom)


Overview of proposed LS-ReMGM model. (left) DE-CVAE network with two encoders and a decoder. QMTs and atomic action vectors are extracted from action-motion using QMTE and AAE modules, respectively (right) QMTE module, AAE module, and atomic action codebook.



Quantized Motion Tokens (QMTs)

Visualization of quantized motion tokens (top left) Motion sequence (top middle) orientational and positional quantizations (top right) extracted quantized motion tokens (bottom) visual representations for QPT, QPRPT, QPDT, QLAT, QLOT, and QJVT.





Qualitative Results on InterX Dataset (Rendering Engine = Open 3D Engine-O3DE)

Prompt: The first person extends his/her left hand and pats the right side of the second person's face.
Prompt: A first person touches the second on shoulder and then they walk while the other person holds right hand on the should of other.
Prompt: The first person sits on the chair, and the second person helps him up by grabbing his left arm with both hands.
Prompt: The first person stands behind the second person, raises their right hand, and waves. The second person turns counterclockwise to look back at the first person.
Prompt: Two people face each other, raise their right hands, and wave their hands above their heads.
Prompt: Two people stand side by side, raising both hands and waving their hands up and down in front of their chests while jumping.
Prompt: Two people walk side by side and one person extends his/her right foot to trip the other person's left foot, causing him/her to fall.
Prompt: One person bends his/her right elbow and rests it on the other person's left shoulder, and they walk forward.
Prompt: One person supports the other person's right hand with his/her left hand while walking forward together, with the former on the left side of the latter.






Qualitative Results on InterHuman Dataset (Rendering Engine = Blender-Cycles)

(No Prompt) Action-Reaction.
(No Prompt) Action-Reaction.
(No Prompt) Action-Reaction.








Evaluation and Comparison with SOTA


Qualitative Evaluations against SOTA

Qualitative comparison of LS-ReMGM on InterHuman dataset with ReGenNet and ReMoS for sequence 1. For input action motion, the corresponding generated reaction and combined motion is shown. Character a (in blue) is the actor and r (in red) is the reactor.




Qualitative comparison of LS-ReMGM on InterHuman dataset with ReGenNet and ReMoS for sequence 2.





Quantitative Evaluation on InterHuman Dataset

Quantitative comparison of LS-ReMGM with state-of-the-art approaches on the InterHuman test set. Values are reported with 95\% confidence intervals (±). Arrows indicate evaluation preference: (↑) for higher-is-better, (↓) for lower-is-better, and (→) for values closest to ground truth. Bold indicates the best performance, while underlining denotes the second-best.





Quantitative Evaluation on InterX Dataset

Quantitative comparison of LS-ReMGM with state-of-the-art approaches on the InterX test set.





Ablation Studies

Component-wise ablation study evaluating the impact of removing key components from the LS-ReMGM model. Results are reported for FID , MPJPE, MPJVE, and M-Cons.

Ablation study investigating the effects of varying latent vector dimensions qpi , qri , and the number of attention heads A on LS-ReMGM performance. Optimal results are obtained with qpi = qr i = 512 and A = 128, balancing fidelity, joint accuracy, temporal smoothness, and interaction consistency.