DreamRunner

DreamRunner: Fine-Grained Storytelling Video Generation
with Retrieval-Augmented Motion Adaptation

University of North Carolina, Chapel Hill

Abstract

Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene.

To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.

Method

Figure 2: Implementation details for region-based diffusion with SR3AI. We extend the vanilla self-attention mechanism to spatial-temporal region-based 3D attention (upper orange part), which is capable of aligning different regions with their respective text descriptions via region-specific masks. The region-based character and motion LoRAs (lower yellow and blue parts) are then injected interleavingly to the attention and FFN layers in each transformer block (the right part). Note that though we resize the visual tokens into sequential 2D latent frames for better visualization, they are flattened and concatenated with all conditions during region-based attention.

Single-Character Storytelling Videos

The Mermaid’s Ocean Journey

The Astronaut's Day in the Space

Multi-Character Storytelling Videos

The Witch's Adventure with Her Cat

The Warrior's Day of Training with His Dog

Compositional Text-to-Video Examples on T2V-ComBench (with CogVideoX-2B)

Dynamic Attribute Binding Examples

A timelapse of a piece of metal gradually rusting when exposed to moisture

A timelapse of a flower bud blooming into a full flower

A timelapse of a ceramic vase glossy and new, becoming cracked and weathered with age

A pumpkin growing from a small bud into a large fruit

A clear sky clouding over

A timelapse of a leaf transitioning from green to bright red as autumn progresses

Object Interaction Examples

Superhero phases through falling debris unharmed

A parent teaches a child to ride a bike

Fork tines press into a piece of cake

Man teaches robot to play chess

Pottery shatters on the floor

A group of volunteers clean up a beach, working together to protect the environment

Spatial Relationship Examples

A duck waddling below a spacecraft

A child walking in front of a car

A gorilla sitting on the left side of a vending machine in a forest

A dog waiting patiently behind a cat

A toddler walking on the left of a dog in a park

A parrot flying under a drone

Motion Binding Examples

The sun rises, moving upwards behind stationary buildings

A golden retriever scampering rightwards across a garden

A robot walking from left to right in a factory

A bubble floats upwards, shimmering and reflecting the sunlight

A robot vacuum is sweeping the floor from right to left

A kite flies left to right on a windy day

Action Binding Examples

A wolf howls into a microphone and a fox plays the drums

A man takes photos and a boy dances on the street

A kangaroo bounds alongside a moving bus

A man cooks dinner and a dog lounges in the sun

A duck swims in a pond and a model ship floats nearby

A chicken pecks at the ground, a bee flies overhead

Consistent Attribute Binding Examples

Spherical globe rotating next to a cube clock

Woolen scarf fluttering near a leather suitcase

Green taxi drives past a yellow building

Green parrot perching on a brown chair

Linen curtains blowing beside a bamboo wall

Blue curtains swaying near a yellow sofa

Quantitative Results on T2V-ComBench

Best/2nd best scores for open-sourced models are bolded/underlined. gray indicates close-sourced models, and yellow indicates the best score for close-sourced models.

BibTeX

@article{zun2024dreamrunner,
    author = {Zun Wang and Jialu Li and Han Lin and Jaehong Yoon and Mohit Bansal},
    title  = {DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation},
	journal   = {arxiv},
	year      = {2024},
	url       = {https://arxiv.org/abs/2411.16657}
}

DreamRunner: Fine-Grained Storytelling Video Generation
with Retrieval-Augmented Motion Adaptation