arxiv:2510.26583

Emu3.5: Native Multimodal Models are World Learners

Published on Oct 30, 2025

· Submitted by

Xinlong Wang on Oct 31, 2025

#2 Paper of the day

Beijing Academy of Artificial Intelligence

Upvote

117

Authors:

Yufeng Cui ,

Haoge Deng ,

Xu Huang ,

Wenxuan Wang ,

Yueze Wang ,

Chengyuan Wang ,

Fan Zhang ,

Ting Pan ,

Abstract

Emu3.5, a large-scale multimodal world model, predicts next states in vision and language, enhanced with reinforcement learning and Discrete Diffusion Adaptation for efficient inference, achieving strong performance in various multimodal tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

View arXiv page View PDF Project page GitHub 1.53k Add to collection

Community

xinlongwang

Paper author Paper submitter Oct 31, 2025

mou678

Oct 31, 2025

•

edited Oct 31, 2025

comment romoved

mou678

Oct 31, 2025

•

edited Oct 31, 2025

deleted

Rookielion

Paper author Oct 31, 2025

🔹	Core Concept	Description
🧠	Unified World Modeling	Predicts the next state jointly across vision and language, enabling coherent world modeling and generation.
🧩	End-to-End Pretraining	Trained with a unified next-token prediction objective over interleaved vision–language sequences.
📚	Over 10T+ Multimodal Tokens	Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure.
🔄	Native Multimodal I/O	Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads.
🎯	RL Post-Training	Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality.
⚡	Discrete Diffusion Adaptation (DiDA)	Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss.
🖼️	Versatile Generation	Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation.
🌐	Generalizable World Modeling	Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios.
🏆	Performance Benchmark	Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks.

librarian-bot

Apr 14

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

muratbey123

7 days ago

[Visual Scene Directives]
The scene begins with a completely empty dark brown/amber background.
Zero assets are visible at frame 00:00:00.000.
Two assets — the character and the digestive system with extraction arrows —
must enter sequentially. The character design must closely match the reference:
bald head, brown beard, green jacket, blue sweater.

[Asset Animation Details]

Asset 1: The Character (Enhanced — The Explanation)

Entry: Rapid slide-up from bottom-left, settling into center-left position.
Arrives with both hands raised in a presenting/shrugging gesture, palms up,
a questioning-but-knowing expression.
Style: Bald head, brown beard, green jacket, blue sweater, dark pants.
Must match the reference character design closely.
Movement: Upon settling, the character performs a slow, deliberate shoulder shrug
(3px upward then settle) with both hands — the universal "here's why" gesture.
The head tilts slightly to the right (4 degrees) with a knowing, almost resigned expression.
Sync Point: Arrives at "Neden mi?"; shrug at "Çünkü"; knowing tilt at "kan hayati olmayan organlardan çekiliyor."

Asset 2: The Digestive System with Extraction Arrows (Enhanced — THE BLOOD DRAIN)

Entry: The digestive system (stomach + coiled intestines) fades in from 0% to 100%
at center-right. It arrives in a desaturated, greyish-pink tone — already drained.
Style: Detailed anatomical digestive system with stomach and coiled intestines,
dark outlines. More detailed than the reference — the tissue is visibly pale and bloodless,
with dark grey veins standing out against the drained pink. Clean vector medical aesthetic.
Movement Phase 1 (The Arrows): Four thick red arrows spawn around the digestive system
(top-left, top-right, bottom-left, bottom-right) and point inward toward the organ.
The arrows perform a continuous, slow inward pulse (scale 100% to 110% with sharp ease,
pointing toward the center) to simulate blood being drawn OUT of the organ.
Movement Phase 2 (The Drain): The digestive system itself performs a slow desaturation pulse
— its pink tissue shifts toward grey (saturation 100% to 60% to 100%) in rhythm with the arrows,
visualizing the blood being extracted. Small red particles (2px) drift from the organ
toward the arrow tips and vanish.
Sync Point: Fades in at "Çünkü"; arrows spawn at "kan"; inward pulse at "hayati olmayan";
drain particles at "organlardan çekiliyor."

[Enhanced Effects]

At arrow spawn: A faint red radial mist (opacity 0% to 10% to 0%) expands from the digestive system.
At drain particles: Small red "blood" droplets (1px) float from the organ toward the arrows,
fading as they reach the arrowheads.
Background: A subtle darkening pulse (brightness 100% to 90%) centered on the digestive system
to simulate the life being drained.

[Mouth & Facial Expression Rules — CRITICAL]

NO lip-sync animation. The mouth does NOT open and close to match speech rhythm.
NO phoneme-based mouth shapes.
The mouth may ONLY reflect emotional state:
- Slightly open line (questioning, during "Neden mi?")
- Neutral closed line (knowing explanation, during "Çünkü")
Mouth changes are SLOW (0.5 second transitions) and tied to emotional beats.
The eyes remain static dots. NO blinking.

[Sequential Entry Rules]

Frame 00:00:00.000 — EMPTY dark brown/amber background only. Voiceover begins immediately.
At "Neden mi?" — Character slides up from bottom-left, hands raised in presenting shrug, questioning expression.
At "Çünkü kan hayati olmayan organlardan çekiliyor." — Digestive system fades in at center-right (desaturated);
four red arrows spawn and pulse inward; blood particles drift from organ to arrows;
tissue desaturates in rhythm.
No asset or effect may activate before its designated sync phrase.

Voiceover Text: "Neden mi? Çünkü kan hayati olmayan organlardan çekiliyor."

[Pacing & Audio Directives]

Voiceover starts at absolute frame 00:00:00.000. No silent intro. No pre-roll animation.
Speech follows a VERY FAST, brisk documentary narrator rhythm.
Do NOT stretch words. Do NOT add artificial pauses. Do NOT drag syllables.
Total voiceover duration: exactly 4.0 seconds.
The phrase "Neden mi?" should snap with curiosity — the character's shrug provides the visual question.
The final phrase "çekiliyor" should land with extraction finality — the arrows and blood particles provide the visual theft.
Speech ends organically after "çekiliyor." with no trailing vocal filler.
Video total length: exactly 6.0 seconds (4.0 seconds voiceover + 2.0 seconds idle tail buffer).

[Visual Idle (Tail) Directives]
Immediately after the voiceover ends at the 4.0-second mark:

The character holds the presenting shrug pose with minimal breathing motion.
The digestive system continues its slow desaturation pulse and arrow inward pulse.
Blood particles continue to drift slowly from the organ toward the arrows.
No new assets enter. No major animation actions begin.
This tail automatically fills the remaining 2.0 seconds of video duration.

[Negative Constraints]

Do NOT stretch the speech
Do NOT add unnatural pauses
Do NOT fill entire video with speech
No silent intro, no pre-roll animation
The character's lips MUST NOT perform lip-sync
The mouth MUST NOT open/close in rhythm with speech
The character MUST NOT blink
The digestive system MUST NOT appear healthy — it must be desaturated from the start
The arrows MUST NOT point outward — they must point inward (extracting)
Do NOT recreate the reference image assets exactly — use it as a concept guide only