ByteDance releases Lance, a 3B-active-parameter unified model handling text-to-image, text-to-video, editing, and understanding in a single framework.
Key Takeaways
Trained from scratch on 128 A100s; only ViT and VAE encoders are pretrained – transformer backbone built entirely new.
Scores 90 overall on GenEval (matching TUNA-7B and beating FLUX.1-dev at 12B params), and 85.11 on VBench, topping all listed unified and generation-only models.
Requires 40GB+ VRAM for inference; supports 480p video up to 121 frames and 768x768 images via a single unified CLI.
Six task modes (t2i, t2v, image_edit, video_edit, x2t_image, x2t_video) share one model checkpoint, reducing deployment overhead for multi-task pipelines.
GEdit-Bench average of 7.30 places it above BAGEL-7B and InternVL-U without chain-of-thought prompting.