Introspective Diffusion Language Models

Yifan Yu*, Yuqing Jian*, Junxiong Wang, Zhongzhu Zhou, Donglin Zhuang, Xinyu Fang, Sri Yanamandra, Xiaoxia Wu, Qingyang Wu, Shuaiwen Leon Song,
Tri Dao, Ben Athiwaratkun, James Zou^†, Fan Lai^†, Chenfeng Xu^†
* Equal contribution † Equal advising

Together AI • University of Illinois Urbana-Champaign • The University of Sydney
Princeton University • Stanford University • The University of Texas at Austin

Paper (arXiv) Code Models BibTeX

I-DLM teaser: introspective consistency and quality-throughput frontier

(a) Standard DLMs generate tokens whose distributions diverge from the model's own next-step predictions; I-DLM trains generation and introspection to agree. (b) Quality vs. throughput on MATH-500: I-DLM-8B matches Qwen3-8B (thinking) AR performance while achieving 3.1x higher throughput.

72.5

AIME-24 (I-DLM-8B)
vs. LLaDA-2.1-mini 43.3

45.1

LiveCodeBench-v6 (I-DLM-8B)
vs. LLaDA-2.1-mini 30.4

2.9–4.1x

Throughput gain over
LLaDA-2.1-mini at C=64

Lossless

Bit-for-bit identical
to base AR model

Demo

▶

Demo video coming soon

Abstract

Diffusion language models (DLMs) offer a compelling promise: parallel token generation could break the sequential bottleneck of autoregressive (AR) decoding. Yet in practice, DLMs consistently lag behind AR models in quality.

We argue that this gap stems from a fundamental failure of introspective consistency: AR models agree with what they generate, whereas DLMs often do not. We formalize this via the introspective acceptance rate, which quantifies whether a model internally accepts its previously generated tokens. Through this lens, we uncover a key structural advantage of AR models: causal masking combined with logit shifting implicitly enforces introspective consistency during training.

Motivated by this insight, we introduce the Introspective Diffusion Language Model (I-DLM), a new paradigm that preserves the introspective consistency of AR training while retaining diffusion-style parallelism. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. This yields a new quality–efficiency frontier unavailable to either AR or prior diffusion models.

Empirically, I-DLM-8B is the first DLM to match the quality of its same-scale AR counterpart while surpassing all prior DLMs in both quality and practical serving efficiency across 15 benchmarks. It attains 72.5 on AIME-24 and 45.1 on LiveCodeBench-v6, outperforming LLaDA-2.1-mini (16B) by more than +29 and +14 points respectively — despite using half the parameters. At high concurrency, I-DLM delivers 2.9–4.1x higher throughput than prior DLMs. With gated LoRA, ISD enables lossless (bit-for-bit identical) acceleration that outperforms speculative decoding baselines at scale.

Why Introspective Consistency?

 Key Insight: The success of AR models is not solely due to next-token prediction, but rather due to a deeper structural property — AR training unifies generation and introspection in one forward pass. Existing DLMs miss this property: they learn to denoise but not to introspect. 

We identify three fundamental bottlenecks in current DLMs:

(1) Low introspective consistency. DLMs cannot reliably agree with their own generations. SDAR achieves only 0.699 acceptance rate vs. I-DLM's 0.984.

(2) Compute inefficiency. Parallel decoding costs far more FLOPs per token than it saves. At TPF ~2.5, TiDAR needs ~7.8x overhead vs. I-DLM's ~2.5x.

(3) Infrastructure incompatibility. Multi-step denoising is poorly aligned with AR serving stacks. SDAR's throughput barely scales with TPF (slope=84 vs. I-DLM's 549).

The I-DLM Method

Introspective-Consistency Training

Convert pretrained AR models into introspective diffusion models via causal attention, logit shift, and an all-masked objective. Every position — masked or clean — is trained under the same directional regime.

Introspective Strided Decoding (ISD)

Generate N tokens per forward pass while verifying prior tokens against causal anchor distributions via the p/q acceptance criterion.

AR-Compatible Serving

Strict causal attention enables direct integration into existing AR serving stacks (SGLang). No custom infrastructure needed — paged KV cache, continuous batching, and CUDA graphs work unmodified.

Comparison of decoding paradigms. I-DLM uses strict causal attention with adaptive stride and is a drop-in replacement within AR serving infrastructure. ISD produces a quality-guaranteed token together with draft tokens via introspective strided decoding.

Experimental Results

I-DLM is the first DLM to match the quality of its same-scale AR counterpart while surpassing all prior DLMs in both quality and practical serving efficiency across 15 benchmarks. I-DLM-8B outperforms LLaDA-2.1-mini (16B) by +29.2 on AIME-24 and +14.7 on LiveCodeBench-v6 despite using half the parameters.

Throughput–latency tradeoff across batch sizes. I-DLM consistently outperforms prior DLMs at moderate to high concurrency, delivering 2.9–4.1x higher throughput at C=64. It also outperforms speculative decoding methods (EAGLE-3, DFlash) starting from C≥16.

End-to-End Quality (Table 2)

Accuracy (%) on 15 benchmarks. I-DLM uses ISD (N=4, sampling). Blue = best non-AR under 30B. Bold = best non-AR under 100B.

	LLaDA-2.1 -mini 16B	LLaDA-2.0 -flash 100B	LLaDA-2.1 -flash 100B	SDAR 8B	SDAR 30B	I-DLM 8B	Qwen3 8B	I-DLM 32B	Qwen3 32B
Knowledge & Reasoning
ARC-C	90.2	---	---	91.9	93.2	95.5	95.8	96.8	97.2
MMLU	74.5	---	---	78.6	82.8	82.4	83.5	86.8	87.2
MMLU-Pro	64.8	74.8	76.6	56.9	61.5	73.1	75.1	79.7	80.1
GPQA-D	46.0	---	---	40.2	36.7	59.1	59.0	62.1	64.1
GPQA	53.3	62.3	67.3	---	---	54.5	56.0	58.7	65.0
Math
GSM8K	89.0	---	---	91.7	91.4	96.0	96.0	94.9	94.7
MATH-500	85.0	---	---	78.6	77.8	95.8	95.8	97.6	97.8
MathBench	84.2	---	---	76.9	79.3	89.1	93.1	95.6	95.5
AIME-24	43.3	---	---	10.0	16.7	72.5	76.7	83.3	76.7
AIME-25	43.3	60.0	63.3	10.0	10.8	61.0	60.0	80.0	80.0
Code
HumanEval	86.0	---	---	78.7	87.2	92.7	95.7	96.3	96.3
MBPP	82.1	---	---	72.0	71.6	92.8	93.4	94.6	95.7
LCB-v6	30.4	42.5	45.4	16.6	21.7	45.1	50.3	57.1	58.3
Instruction Following
IFEval	83.2	82.6	83.6	61.4	60.6	84.7	84.7	84.7	84.5

Extended Comparison (Table 3)

Benchmarks commonly reported across diffusion LLM methods. "---" = not reported.

Method	GSM8K	MMLU	HumanEval	MBPP	IFEval
Qwen3-8B (AR)	96.0	83.5	95.7	93.4	84.7

NBDiff (7B)	91.0	82.9	89.0	87.6	60.8
Jacobi Forcing (7B)	91.4	---	83.5	70.4	---
WeDLM (8B)	90.2	75.5	75.0	67.0	---
LightningRL (8B)	90.3	---	72.6	58.3	---
TiDAR (8B)	80.4	76.6	57.9	65.4	---
DREAM (7B)	81.0	70.6	57.9	58.8	62.5
Fast-dLLM (7B)	78.5	---	43.3	28.2	---
Mercury Coder Small	---	---	90.0	76.6	---
Gemini Diffusion	---	---	89.6	76.0	---

Ours (8B)	96.0	82.4	92.7	92.8	84.7

Citation

@article{yu2025introspective,
  title={Introspective Diffusion Language Models},
  author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu
          and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri
          and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon
          and Dao, Tri and Athiwaratkun, Ben and Zou, James
          and Lai, Fan and Xu, Chenfeng},
  journal={arXiv preprint},
  year={2025}
}