Logo Factuality Matters: When Image Generation and Editing Meet Structured Visuals

1CUHK MMLab    2Beihang University    3Krea AI    4Shanghai Jiao Tong University
5Shanghai AI Lab    6Hugging Face    7National University of Singapore
8ByteDance    9The University of Hong Kong
*Equal Contribution    Corresponding Authors

beautiful teaser
Overview of our work. Left: We showcase the diverse text-to-image (T2I) and editing examples from our dataset. In contrast to natural images, modeling structured visual demands sophisticated composition planning, strong multimodal understanding, and precise text rendering, as highlighted by the three key characteristics. Right: Our model demonstrates competitive performance against leading closed-source systems in both structured image generation and editing benchmarks.

Abstract

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.


StructBench Leaderboard


StructEditBench — reporting Accuracy (%) & PSNR. Click any column to sort. Best Performance Second Best

Model Math Chart Graph Puzzle Science Table Overall
Acc ↑PSNR ↑ Acc ↑PSNR ↑ Acc ↑PSNR ↑ Acc ↑PSNR ↑ Acc ↑PSNR ↑ Acc ↑PSNR ↑ Acc ↑PSNR ↑
Nano Banana 50.4620.77 46.4214.93 52.9721.22 66.5622.92 69.1622.61 75.7119.75 51.5721.09
GPT-Image 51.4917.06 45.8212.56 50.7117.24 76.0316.55 67.6116.61 83.2614.35 52.2016.64
Seedream 4.0 51.0623.63 46.8316.54 51.7224.12 71.1326.93 69.2226.46 88.1924.75 52.8524.45
UniWorld-V1 9.418.84 5.997.87 8.836.16 9.117.71 19.917.67 16.138.24 8.408.21
DiMOO 26.7921.56 16.5214.98 24.0321.77 29.5222.26 26.0822.57 24.6419.47 21.0021.49
OmniGen2 29.4415.95 18.5512.44 34.6311.31 28.6116.51 39.5515.60 30.3616.60 24.3015.49
Ovis-U1 31.6418.45 21.9413.30 38.0319.01 42.0817.92 44.5218.68 35.5816.62 28.0618.25
Hidream-E1.1 28.0718.43 26.3612.91 29.6318.26 43.7718.04 36.6616.47 48.7917.12 29.6318.01
Bagel 21.2721.38 27.1116.38 29.9422.70 41.5924.22 47.1623.56 47.3521.54 28.8722.06
Bagel-Think 37.4023.97 28.9816.82 42.5126.49 36.1126.75 43.1525.57 40.4623.83 33.3424.70
Step1X-Edit 34.4723.41 28.0516.68 33.2624.56 60.4825.94 46.4724.98 57.8123.97 34.1124.03
FLUX.1 Kontext 37.3619.78 32.2914.61 39.1220.10 58.3520.38 50.3920.99 58.0518.52 37.5619.84
Qwen-Edit 40.4823.73 30.1712.33 44.8326.11 53.7427.31 55.9925.53 67.7625.71 38.1224.81
Ours 54.7423.31 50.5815.33 60.1824.65 73.0026.33 75.0525.80 77.0823.19 55.9824.01

Quantitative comparison on StructT2IBench, reporting Accuracy (%). Each column is clickable to view its ranking. Best Performance Second Best

Model Chart Graph Math Puzzle Science Table Overall
Acc ↑ Acc ↑ Acc ↑ Acc ↑ Acc ↑ Acc ↑ Acc ↑
Seedream 4.0 35.79 54.08 63.33 50.89 62.59 68.94 47.52
Nano banana 35.55 58.96 64.81 63.87 60.75 67.20 48.45
GPT-Image 37.09 57.00 63.25 59.42 60.94 83.31 49.58
UniWorld-V1 1.71 5.52 4.72 1.58 8.82 5.25 3.20
Bagel 4.66 3.61 4.02 4.46 8.60 5.74 4.69
Bagel-Think 4.81 15.33 13.89 15.22 19.05 8.97 9.03
Hidream-I1-Full 9.47 20.84 19.20 18.00 26.77 27.05 14.77
OmniGen2 10.67 22.51 22.89 18.63 28.00 22.61 16.24
FLUX.1 Dev 12.35 20.09 19.86 20.63 25.25 27.00 16.51
FLUX.1 Kontext 17.22 24.64 21.42 24.06 30.97 29.16 20.36
Ovis-U1 24.75 16.08 19.45 21.23 26.03 12.70 22.83
Qwen-Image 32.23 48.05 46.98 48.90 53.51 73.65 41.03
Ours 20.91 33.45 41.70 30.66 41.46 32.26 28.80

StructEditBench (charts only) — reporting Accuracy (%) & PSNR. Click any column to sort. Best Performance Second Best

Model Category Color Num Auxiliary Add&Del Overall
Acc ↑ PSNR ↑ Acc ↑ PSNR ↑ Acc ↑ PSNR ↑ Acc ↑ PSNR ↑ Acc ↑ PSNR ↑ Acc ↑ PSNR ↑
GPT-Image 40.629.85 54.5713.91 33.1213.48 64.0313.48 38.0212.73 45.8212.56
Nano Banana 39.7510.33 54.3417.59 35.6416.56 67.4017.39 36.7713.88 46.4214.93
Seedream 4.0 38.139.67 61.8421.77 36.0019.22 65.9219.46 36.0414.28 46.8316.54
UniWorld-V1 5.987.60 8.198.29 2.587.57 10.248.62 2.817.35 5.997.87
DiMOO 11.209.82 15.3116.97 17.3917.30 21.5717.59 18.5714.46 16.5214.98
OmniGen2 17.308.61 28.8411.78 15.4814.03 22.4216.59 10.5412.11 18.5512.44
Ovis-U1 18.159.57 30.6815.38 20.7915.13 25.4914.25 17.2113.11 21.9413.30
Hidream-E1.1 22.699.29 39.8614.07 21.4915.04 32.6514.34 18.0512.71 26.3612.91
Bagel 25.699.08 38.2020.46 26.3020.29 30.0021.24 17.7914.93 27.1116.82
Step1X-Edit 21.9610.40 36.5119.75 25.4620.22 34.4019.46 24.9215.11 28.0516.68
Bagel-Think 24.559.00 45.4619.35 25.6020.14 34.5420.62 18.6914.58 28.9816.38
Qwen-Edit 23.539.52 41.8013.63 22.9013.69 42.3913.07 23.2712.46 30.1712.33
FLUX.1 Kontext 24.6710.14 44.5616.53 29.2416.74 44.0216.79 23.0613.95 32.2914.61
Ours 50.8110.40 64.1018.16 33.4517.04 66.3417.90 38.1014.37 50.5815.33

Structured Image Dataset


We release a large-scale Structured Image Dataset tailored for generation and editing of charts, diagrams, math figures, and more. Starting from ~2M executable programs (Python, LaTeX) across diverse categories, we render valid source images, use GPT-5 to extract salient visual features, and jointly produce aligned image-editing and code-editing instructions. GPT-5 then applies code edits to synthesize target programs and images, yielding strictly aligned source-target pairs. A comprehensive post-processing pipeline removes invalid, low-difference, and low-information samples, resulting in 1.3M high-quality examples. Each example includes source/target images, a dense caption, an image-editing instruction, and a three-step reasoning trajectory to support precise, fact-grounded structured visual generation and editing.

online demo
Data construction pipeline. We prompt GPT-5 to extract salient features, then generate paired editing instructions from the source code and rendered image. The source code is modified according to the code-editing instructions. The target image rendered from modified code is passed through rule-based filters to ensure the overall quality of the constructed dataset.

Benchmark Construction


StructBench is a curated benchmark for structured visual generation and editing, spanning Math, Graph, Chart, Puzzle, and Table domains. We select diverse, high-quality items from our code-rendered dataset via clustering, stratified sampling, and dual GPT-5/human review. For each item, GPT-5 produces detailed descriptions that are decomposed into atomic Q&A pairs covering fine-grained attributes and relations. Evaluation uses StructScore, a controlled multi-turn VLM protocol with open-ended answers compared against concise ground truths. Through human audits and iterative Q&A refinement, metric reliability on ground-truth images improves from ~80% to >95%. The final benchmark includes 1,714 items with 32,031 and 37,941 Q&A pairs for editing and generation respectively.

method
Benchmark construction and evaluation workflow. (a) Benchmark construction: We cluster the data into six categories, and for each editing and text-to-image (T2I) example, GPT-5 generates detailed image descriptions that are transformed into Q\&A pairs for evaluating diverse visual aspects. (b) Evaluation protocol: Using the Q\&A pairs, GPT-5 is queried on generated images for open-ended responses, which are compared with ground-truth answers to yield a final score.

Model Training


We build on FLUX.1 Kontext, adapting its unified diffusion transformer for structured image generation and editing. Text is encoded with T5 (we discard CLIP), while both input and target images are encoded by a VAE; these tokens are concatenated and processed with joint attention. To strengthen high-level semantic understanding crucial for structured visuals, we add a lightweight MLP connector that aligns Qwen-VL multimodal features with the backbone, offering stable optimization and low overhead. Training proceeds in three stages: (1) Unified Alignment—freeze the backbone and train only the connector using simple data, suppressing T5 to avoid shortcutting; (2) Hybrid Visual Learning—jointly fine-tune on structured and general datasets with a mask-based loss that downweights backgrounds and unchanged regions; and (3) Thinking Enhancement—inject chain-of-thought reasoning via Qwen-VL and enable inference-time reasoning, where a VLM analyzes the input and guides the generator for complex, semantically grounded edits.

method
The three-stage progressive training pipeline. Training difficulty increases across stages, from alignment to hybrid visual learning and thinking enhancement.

Citation

@article{zhuo2025structbench,
  title={Factuality Matters: When Image Generation and Editing Meet Structured Visuals},
  author={Zhuo, Le and Han, Songhao and Pu, Yuandong and Qiu, Boxiang and Paul, Sayak and Liao, Yue and Liu, Yihao and Shao, Jie and Chen, Xi and Liu, Si and Li, Hongsheng},
  journal={arXiv preprint arXiv:2510.05091},
  year={2025}
}


Acknowledgements

We thank Ximing Xing for providing us with the source code of the web page to help us build the project home page.