Skip to content

[NeurIPS 2023] FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



3 Commits

Repository files navigation

FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

1S-Lab, Nanyang Technological University  2SenseTime Research 
+corresponding author

visitor badge

Abstract: Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. To facilitate a large-scale study on this new fine-grained motion generation task, we also contribute the HuMMan-MoGen dataset, which contains fine-grained description for each body part and each action stage.

Pipeline Overview: FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention (SAMI). SAMI optimizes the generation of the global attention template from three perspectives: 1) Temporal Independence: we regard each global template as a time-varied signal, which allows us to extrapolate the feature refinement between different time intervals. 2) Spatial Independence: we manually divide the raw motion data into several body parts, process them independently in FFN modules and apply sptial refinement in SAMI modules. 3) Sparsely-Activated Mixture-of-Expert: we broaden the overall network structure to enhance learning capability and overcome the training difficulties brought by spatial-temporal independence modelling.


[12/2023] Release code for FineMoGen, MoMat-MoGen, ReMoDiffuse and MotionDiffuse

Benchmark and Model Zoo

Supported methods


If you find our work useful for your research, please consider citing the paper:

  title={FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing},
  author={Zhang, Mingyuan and Li, Huirong and Cai, Zhongang and Ren, Jiawei and Yang, Lei and Liu, Ziwei},
  title={ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model},
  author={Zhang, Mingyuan and Guo, Xinying and Pan, Liang and Cai, Zhongang and Hong, Fangzhou and and Yang, Lei and Liu, Ziwei},
  journal={arXiv preprint arXiv:2304.01116},
  title={MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model},
  author={Zhang, Mingyuan and Cai, Zhongang and Pan, Liang and Hong, Fangzhou and Guo, Xinying and Yang, Lei and Liu, Ziwei},
  journal={arXiv preprint arXiv:2208.15001},


# Create Conda Environment
conda create -n mogen python=3.9 -y
conda activate mogen

# C++ Environment
export PATH=/mnt/lustre/share/gcc/gcc-8.5.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/lustre/share/gcc/gcc-8.5.0/lib:/mnt/lustre/share/gcc/gcc-8.5.0/lib64:/mnt/lustre/share/gcc/gmp-4.3.2/lib:/mnt/lustre/share/gcc/mpc-0.8.1/lib:/mnt/lustre/share/gcc/mpfr-2.4.2/lib:$LD_LIBRARY_PATH

# Install Pytorch
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch -y

# Install MMCV
pip install "mmcv-full>=1.4.2,<=1.9.0" -f

# Install Pytorch3d
conda install -c bottler nvidiacub -y
conda install -c fvcore -c iopath -c conda-forge fvcore iopath -y
conda install pytorch3d -c pytorch3d -y

# Install tutel
python3 -m pip install --verbose --upgrade git+

# Install other requirements
pip install -r requirements.txt

Data Preparation

Download data files from google drive link. Unzipped all files and arrange them in the following file structure:

├── mogen
├── tools
├── configs
├── logs
│   ├── finemogen
│   ├── motiondiffuse
│   ├── remodiffuse
│   └── mdm
└── data
    ├── database
    ├── datasets
    ├── evaluators
    └── glove


Training with a single / multiple GPUs

PYTHONPATH=".":$PYTHONPATH python tools/ ${CONFIG_FILE} ${WORK_DIR} --no-validate

Note: The provided config files are designed for training with 8 gpus. If you want to train on a single gpu, you can reduce the number of epochs to one-fourth of the original.

Training with Slurm

./tools/ ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} --no-validate

Common optional arguments include:

  • --resume-from ${CHECKPOINT_FILE}: Resume from a previous checkpoint file.
  • --no-validate: Whether not to evaluate the checkpoint during training.

Example: using 8 GPUs to train ReMoDiffuse on a slurm cluster.

./tools/ my_partition my_job configs/finemogen/ logs/finemogen_kit 8 --no-validate


Evaluate with a single GPU / multiple GPUs

PYTHONPATH=".":$PYTHONPATH python tools/ ${CONFIG} --work-dir=${WORK_DIR} ${CHECKPOINT}

Evaluate with slurm



./tools/ my_partition test_finemogen configs/finemogen/ logs/finemogen/finemogen_kit logs/finemogen/finemogen_kit/latest.pth

Note: Run full evaluation for HumanML3D dataset is very slow. You can change replication_times in to $1$ for a quick evaluation.


Visualization for a single motion

    --text ${TEXT} \
    --motion_length ${MOTION_LENGTH} \
    --device cpu


PYTHONPATH=".":$PYTHONPATH python tools/ \
    configs/remodiffuse/ \
    logs/finemogen/finemogen_t2m/latest.pth \
    --text "a person is running quickly" \
    --motion_length 120 \
    --out "test.gif" \
    --device cpu

Visualization for temporal composition

    --text ${TEXT1} ${TEXT2} ${TEXT3} ... \
    --motion_length ${MOTION_LENGTH1} ${MOTION_LENGTH2} ${MOTION_LENGTH3} ... \
    --device cpu


PYTHONPATH=".":$PYTHONPATH python tools/ \
    configs/finemogen/ \
    logs/finemogen/finemogen_t2m/latest.pth \
    --text "a person walks 4 steps forward" "a person stops and looks around" "a perons sits down" "a person cries" "a person jumps backward"  \
    --motion_length 60 60 60 60 60 \
    --out "test.gif"


This study is supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOE-T2EP20221-0012), NTU NAP, and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

The visualization tool is developed on top of Generating Diverse and Natural 3D Human Motions from Text


[NeurIPS 2023] FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing







No releases published


No packages published