-
HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation논문 정리/Visual Generation 2025. 3. 18. 22:23
이 연구는 pose-guided DiT based framework인 HumanDiT를 소개한다.
HumanDiT는
1. 다양한 video resolution과 변화가능한 sequence length를 가질 수 있다.
2. prefix-latent reference stratgy를 이용하여 personalized characteristics를 가질 수 있다.
3. Pose adapter를 활용하여 pose transfer를 한다.
Introduction
Limitation
1. temporal consistency in long-sequence generation
2. limited ability to generalize across varied scenarios
3. fixed resolution input
4. designed for pose transfer with a given poses sequence, and any misalignmnent in pose can result in visual artifacts
HumanDiT
- adaptable pose-guided body animation framework
- designed for diverse resolutions and long-form video generation1. variable resolution, dynami sequence length를 위해 DiT
- prefix-latent reference strategy를 사용하여 visual consistency accross inputs while accommodating diverse resolution and durations
- pose guider: capture temporal spatial features via patch-based estraction
2. large-scale, diverse dataset, 14000 hours of in-the-wild videos
- data processing pipeline: extraction, filtering strategy with scoring models, expecially for detail-rich hands and teeth.
3. expert Keypoint-DiT for pose generationRelated Works
Methodology
Experiments
'논문 정리 > Visual Generation' 카테고리의 다른 글