Demo video coming soon
Diffusion language models (DLMs) offer a compelling promise: parallel token generation could break the sequential bottleneck of autoregressive (AR) decoding. Yet in practice, DLMs consistently lag behind AR models in quality.
We argue that this gap stems from a fundamental failure of introspective consistency: AR models agree with what they generate, whereas DLMs often do not. We formalize this via the introspective acceptance rate, which quantifies whether a model internally accepts its previously generated tokens. Through this lens, we uncover a key structural advantage of AR models: causal masking combined with logit shifting implicitly enforces introspective consistency during training.
Motivated by this insight, we introduce the Introspective Diffusion Language Model (I-DLM), a new paradigm that preserves the introspective consistency of AR training while retaining diffusion-style parallelism. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. This yields a new quality–efficiency frontier unavailable to either AR or prior diffusion models.
Empirically, I-DLM-8B is the first DLM to match the quality of its same-scale AR counterpart while surpassing all prior DLMs in both quality and practical serving efficiency across 15 benchmarks. It attains 72.5 on AIME-24 and 45.1 on LiveCodeBench-v6, outperforming LLaDA-2.1-mini (16B) by more than +29 and +14 points respectively — despite using half the parameters. At high concurrency, I-DLM delivers 2.9–4.1x higher throughput than prior DLMs. With gated LoRA, ISD enables lossless (bit-for-bit identical) acceleration that outperforms speculative decoding baselines at scale.
We identify three fundamental bottlenecks in current DLMs:
Convert pretrained AR models into introspective diffusion models via causal attention, logit shift, and an all-masked objective. Every position — masked or clean — is trained under the same directional regime.
Generate N tokens per forward pass while verifying prior tokens against causal anchor distributions via the p/q acceptance criterion.
Strict causal attention enables direct integration into existing AR serving stacks (SGLang). No custom infrastructure needed — paged KV cache, continuous batching, and CUDA graphs work unmodified.
I-DLM is the first DLM to match the quality of its same-scale AR counterpart while surpassing all prior DLMs in both quality and practical serving efficiency across 15 benchmarks. I-DLM-8B outperforms LLaDA-2.1-mini (16B) by +29.2 on AIME-24 and +14.7 on LiveCodeBench-v6 despite using half the parameters.
Accuracy (%) on 15 benchmarks. I-DLM uses ISD (N=4, sampling). Blue = best non-AR under 30B. Bold = best non-AR under 100B.
| LLaDA-2.1 -mini 16B | LLaDA-2.0 -flash 100B | LLaDA-2.1 -flash 100B | SDAR 8B | SDAR 30B | I-DLM 8B | Qwen3 8B | I-DLM 32B | Qwen3 32B | |
|---|---|---|---|---|---|---|---|---|---|
| Knowledge & Reasoning | |||||||||
| ARC-C | 90.2 | --- | --- | 91.9 | 93.2 | 95.5 | 95.8 | 96.8 | 97.2 |
| MMLU | 74.5 | --- | --- | 78.6 | 82.8 | 82.4 | 83.5 | 86.8 | 87.2 |
| MMLU-Pro | 64.8 | 74.8 | 76.6 | 56.9 | 61.5 | 73.1 | 75.1 | 79.7 | 80.1 |
| GPQA-D | 46.0 | --- | --- | 40.2 | 36.7 | 59.1 | 59.0 | 62.1 | 64.1 |
| GPQA | 53.3 | 62.3 | 67.3 | --- | --- | 54.5 | 56.0 | 58.7 | 65.0 |
| Math | |||||||||
| GSM8K | 89.0 | --- | --- | 91.7 | 91.4 | 96.0 | 96.0 | 94.9 | 94.7 |
| MATH-500 | 85.0 | --- | --- | 78.6 | 77.8 | 95.8 | 95.8 | 97.6 | 97.8 |
| MathBench | 84.2 | --- | --- | 76.9 | 79.3 | 89.1 | 93.1 | 95.6 | 95.5 |
| AIME-24 | 43.3 | --- | --- | 10.0 | 16.7 | 72.5 | 76.7 | 83.3 | 76.7 |
| AIME-25 | 43.3 | 60.0 | 63.3 | 10.0 | 10.8 | 61.0 | 60.0 | 80.0 | 80.0 |
| Code | |||||||||
| HumanEval | 86.0 | --- | --- | 78.7 | 87.2 | 92.7 | 95.7 | 96.3 | 96.3 |
| MBPP | 82.1 | --- | --- | 72.0 | 71.6 | 92.8 | 93.4 | 94.6 | 95.7 |
| LCB-v6 | 30.4 | 42.5 | 45.4 | 16.6 | 21.7 | 45.1 | 50.3 | 57.1 | 58.3 |
| Instruction Following | |||||||||
| IFEval | 83.2 | 82.6 | 83.6 | 61.4 | 60.6 | 84.7 | 84.7 | 84.7 | 84.5 |
Benchmarks commonly reported across diffusion LLM methods. "---" = not reported.
| Method | GSM8K | MMLU | HumanEval | MBPP | IFEval |
|---|---|---|---|---|---|
| Qwen3-8B (AR) | 96.0 | 83.5 | 95.7 | 93.4 | 84.7 |
| NBDiff (7B) | 91.0 | 82.9 | 89.0 | 87.6 | 60.8 |
| Jacobi Forcing (7B) | 91.4 | --- | 83.5 | 70.4 | --- |
| WeDLM (8B) | 90.2 | 75.5 | 75.0 | 67.0 | --- |
| LightningRL (8B) | 90.3 | --- | 72.6 | 58.3 | --- |
| TiDAR (8B) | 80.4 | 76.6 | 57.9 | 65.4 | --- |
| DREAM (7B) | 81.0 | 70.6 | 57.9 | 58.8 | 62.5 |
| Fast-dLLM (7B) | 78.5 | --- | 43.3 | 28.2 | --- |
| Mercury Coder Small | --- | --- | 90.0 | 76.6 | --- |
| Gemini Diffusion | --- | --- | 89.6 | 76.0 | --- |
| Ours (8B) | 96.0 | 82.4 | 92.7 | 92.8 | 84.7 |
@article{yu2025introspective,
title={Introspective Diffusion Language Models},
author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu
and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri
and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon
and Dao, Tri and Athiwaratkun, Ben and Zou, James
and Lai, Fan and Xu, Chenfeng},
journal={arXiv preprint},
year={2025}
}