Modeling virtual characters that can react to the actions of another character or human can benefit automated computer animation, human-robot interaction, and social behavior generation in digital environments. Despite significant progress in motion generation, most existing works focus on generating motion for a single character, while the generation of reaction motion, particularly in the context of human interactions conditioned solely on action sequences, remains largely understudied. Moreover, for optimal action-reaction motion mapping, the learned latent space must be (1) highly disentangled, meaning similar motions should be closer and dissimilar motions should be apart in the latent space and (2) action and reaction motion spaces must be closely aligned, that is, two variables having the same value sampled from action and reaction distributions must correspond to the correctly mapped action-reaction motion pair. Moreover, effective motion representation is crucial for models to comprehend the underlying motion structures and offer semantic guidance. It is essential to note that there exists a trade-off between the representation level and motion semantics. Higher-level representations, such as motion class labels or textual descriptions, lack fine-grained motion information. Conversely, lower-level representations, such as joint locations or orientations, encounter precision problems and introduce complexity during the training phase. Therefore, an intermediate representation provides more effective motion semantics. Various schemes have been proposed to effectively represent motion, including pose tokens, motion descriptors, global/local motion cues, and kinematic phrases. Although these schemes excel in motion recognition and classification tasks, they often struggle to generalize well to motion generation tasks. We propose LS-ReMGM, a novel reaction motion generation model based on a dual-encoder CVAE, designed to produce semantically aligned human reactions. It comprises two encoders to learn the action and reaction motion spaces independently, with a shared decoder generating the reaction-motion sequence. Our objectives are twofold: (1) to enhance action-reaction mapping by effectively regularizing and aligning the two motion spaces. For this, we enforce the two encoders to learn similar probability distributions while disentangling the latent spaces with an enhanced conditional signal. (2) to provide a better motion representation and capture the nuances of the underlying motion structures. To address this issue, we propose novel quantized motion tokens and atomic action vectors as rich intermediate motion representations. The LS-ReMGM uses an action-motion sequence $ x_a^{1:U}$ and extracts quantized motion tokens and atomic action vectors to generate the conditional signal. The two encoders and a decoder use this conditional signal as a bias to disentangle and regularize the motion spaces. This results in an improved the action-reaction mapping. The reaction-motion encoder encodes the reaction-motion sequence $ x_r^{1:V}$. The decoder then reconstructs the corresponding reaction-motion by learning a mapping function. Moreover, the guided alignment at two encoders ensures that the two distributions are similar. This allows the decoder to sample a variable from the reaction space during training and from the action space during inference.
Overview of proposed LS-ReMGM model. (left) DE-CVAE network with two encoders and a decoder. QMTs and atomic action vectors are extracted from action-motion using QMTE and AAE modules, respectively (right) QMTE module, AAE module, and atomic action codebook.
Visualization of quantized motion tokens (top left) Motion sequence (top middle) orientational and positional quantizations (top right) extracted quantized motion tokens (bottom) visual representations for QPT, QPRPT, QPDT, QLAT, QLOT, and QJVT.