Seedance 2.0 Reference Input Guide
Use 1 to 3 reference images more effectively in Seedance 2.0 to lock identity, product geometry, and scene consistency.
Reference-to-video is not just “image-to-video with more files.” It is the mode you use when stability is the assignment.
In the current Seedance 2.0 workflow, the reference path is built around one to three reference images. That is enough for most use cases, as long as each image has a clear role.
When to choose reference-to-video
Use reference-to-video when your biggest risk is drift:
- face identity changes
- the product changes shape
- hands break during interaction
- wardrobe or props mutate
- the scene loses continuity across retries
If you only need to animate a still image, image-to-video is simpler. Use reference-to-video when continuity matters more than simplicity.
Give each reference a job
The cleanest reference workflows usually assign roles like this:
| Reference | Best role |
|---|---|
| Image 1 | main identity or hero object |
| Image 2 | supporting angle, outfit, or product detail |
| Image 3 | optional color, environment, or secondary continuity cue |
Do not upload three images that all fight each other on pose, lighting, and styling. More references only help when they are aligned.
What makes a strong reference set
Your references should agree on the things you want preserved:
- same person or same product
- compatible lighting logic
- compatible styling
- similar quality level
Your references should not disagree on:
- age or face shape
- product proportions
- costume color
- camera distance
Conflicting references make the model average them, which is where drift starts.
The best prompt structure for reference mode
In reference mode, the order changes slightly:
- state the stability rule
- define the action
- define one camera move
- define style
- define constraints
Example:
@Image1 creator identity remains consistent, holds the skincare bottle near the face, subtle push-in, soft daylight beauty review setup, no face drift no finger artifacts no bottle shape changeThe key difference is that consistency comes before atmosphere.
When reference mode is the better answer
Reference-to-video is usually the better choice when:
- a creator must remain recognizable
- a product demo depends on accurate shape
- hands are touching the hero object
- multiple retries need to stay on-brand
- you are building a sequence from several short clips
This is especially relevant for:
- UGC ads
- beauty demos
- packaging shots
- fashion accessories
- creator explainers
Common reference mistakes
Using references to solve a prompt problem
If the scene is vague, references will not save it. You still need:
- one clear action
- one camera instruction
- one stable visual hierarchy
Using too many visual ideas in one shot
Reference mode is for protecting continuity, not for asking the model to do everything at once. Keep the shot narrow:
- one action
- one hero subject
- one focal intention
Forgetting to protect hands
If hands are on screen, say so in the negative prompt. Hand stability does not improve just because references exist.
A simple reference workflow
For creators
- Image 1: clear face and upper-body anchor
- Image 2: product hold or outfit support
- Prompt: one speaking or holding action only
For products
- Image 1: clean hero packshot
- Image 2: alternate angle or material detail
- Prompt: one reveal or one hold, not a full mini-commercial
For characters
- Image 1: identity anchor
- Image 2: costume or silhouette support
- Image 3: optional environment palette if it does not conflict
Failure modes and fixes
| Problem | Likely cause | First fix |
|---|---|---|
| Face changes across frames | references conflict or action is too big | reduce pose change and use one clearer anchor |
| Product shape drifts | too much camera motion | simplify the move and name geometry preservation |
| Hands still break | the hand action is too expressive | use a simpler gesture and strengthen hand constraints |
| Output looks stiff | continuity rules are too dominant | keep the lock rule, then add one subtle motion layer |
When to switch back out of reference mode
If you notice that the output is stable but too stiff, that may mean the problem is no longer continuity. At that point:
- switch to image-to-video if the first frame matters most
- switch to text-to-video if you want broader creative exploration
Reference mode is strongest when protecting identity is the main objective.
DeepSeek V4 Video ドキュメント