Home / Blog / Diffusion

Gemini Diffusion

Published on May 30, 2025

First impressions

I just got access to Gemini Diffusion. The speed is mind-blowing. From what I can tell, the model diffuses blocks of text sequentially—about two paragraphs per block. Each finished block is fed back in as context for the next one.

I tried it on a handful of tasks:

Planning an event
Coding a small app
Solving a LeetCode-style interview question
Summarization of a long document
Question-and-answer with provided context (car rental agreement)
Writing a long essay
Generating a Markdown table

Formatting can be a bit messy, and you sometimes get repetition between neighboring paragraphs. It also struggles to surface some of the relevant nuggets of information when the user prompt is huge (e.g. the car rental agreement). Still, overall performance is quite impressive.

Riddle and Chain-of-Thought

I also tried to get the model to solve a riddle. It got it wrong. Gemini Diffusion screenshot showing the riddle and the wrong answer My next thought was to try Chain-of-Thought (CoT) prompting. Two things happened:

The <think> tags were removed from the prompt.
The model kept generating tokens until it hit roughly 2^14 tokens, then displayed nothing.

Gemini Diffusion screenshot showing the riddle prompt

Gemini Diffusion screenshot showing the riddle reply

Questions and thoughts

Image generation is edging towards auto-regressive methods (4o Image Generation), just when diffusion-based text generation becomes practical.
How will inference-time reasoning work? And is the above CoT failure a sign it is already doing some reasoning behind the scenes? In a diffusion world, I imagine you generate thinking text first, potentially with a couple of refinement rounds, then feed it back in to diffuse the first and subsequent response blocks.
Will users mind that it does not stream the text? You could always fake streaming by chunking the output.
Could we mix auto-regressive and diffusion approaches for text the same way we already blend them for images?