FraudBench

822

Real-World
Review Samples

7,928

Total
Images

29

Total
Categories

6

AI Generation
Models

5

Evaluation
Dimensions

15+

Models &
Detectors

Synthesized with: GPT Image 2 Grok Imagine Nano Banana 2 Wan2.7-Image Qwen-Image-2.0 Qwen-Image-Edit

Abstract

Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored.

To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models.

Using FraudBench, we evaluate 11 MLLMs, 4 specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.

⚠️ Note: This paper is intended solely for academic research purposes. It studies the risk of AI-generated fraudulent refund evidence to support the development of more reliable detection methods, platform safeguards, and responsible evaluation. The goal is not to facilitate refund fraud or provide actionable guidance for misuse.

Construction Pipeline

Figure 1. Construction pipeline of FraudBench in three stages: (1) Data Collection — real-world refund evidence from four sources across 29 categories. (2) Data Preprocessing — multimodal aggregation, two-level cleaning, rule-based filtering, representative sampling, and anonymization with human verification. (3) Synthetic Evidence Generation — image-specific prompts fed to six SOTA models, with human quality control.

Benchmark Overview

📊 Dataset Composition

Data Split	Images	Description
Real-Damaged	1,012	Genuine user-submitted damaged evidence Evaluation set · negative class
Real-Undamaged	988	Genuine undamaged reference images Synthesis source only · excluded from evaluation
Fake-Damaged	5,928	AI-synthesized: 988 references × 6 generators Evaluation set · positive class
Total	7,928	822 Reviews · 29 Categories · 4 Data Sources

🌐 Data Sources & Scenarios

Amazon Reviews 2023

590 review samples across 27 e-commerce product categories

Trip.com

Travel-service reviews covering hotel, attraction, and transportation complaints

GrabFood

Food delivery and dine-in & pickup service evidence

In-House Captured

Self-collected real-world samples under everyday acquisition conditions

Figure 2. Representative examples covering the five evaluation dimensions: input modality, contextual information, multi-step reasoning, prompt sensitivity, and real image preservation.

Evaluation Protocol

FraudBench evaluates detection capability along five complementary dimensions reflecting real-world refund-evidence verification conditions.

1

Input Modality

Single-Image vs. Multi-Image evaluation. Multi-image setting allows the model to exploit cross-image cues such as consistent lighting, viewpoint, and damage appearance within the same review.

2

Contextual Information

Review-Free vs. Review-Conditioned evaluation. Tests whether associated user review text improves detection quality or introduces misleading shortcuts.

3

Multi-Step Reasoning

Compares direct joint inference against a structured per-image decomposition step before a final aggregated verdict, testing whether evidence integration improves reliability.

4

Prompt Sensitivity

Evaluates each detector with five semantically equivalent prompt styles to measure whether decisions are driven by stable visual evidence or surface prompt phrasing.

5

Real Image Preservation

Assesses True Negative Rate (TNR) on genuine damaged evidence. Central to trustworthy refund adjudication — falsely accusing an honest customer damages platform trust.

⚖️ Asymmetric Risk in Refund Verification

False Positive

Wrongly Accusing an Honest Customer → Severe Trust Crisis

vs

False Negative

Missing AI-Fabricated Evidence → Direct Financial Losses

Key Findings

The Central Dilemma of Current Detection Paradigms

Over-Credulity (MLLMs) vs. Over-Sensitivity (Specialized Detectors)

MLLMs preserve genuine evidence but systematically miss AI-manipulated damage. Specialized detectors detect synthetic images but frequently flag real damaged goods as fake.

Finding 1

MLLM Failures Driven by Under-Detection

Across MLLMs in the single-image no-review setting, the average real-image TNR is 0.947 — but the average fake-damage TPR is only 0.197. The dominant failure mode is not excessive rejection of genuine evidence, but systematic under-detection of synthetic refund claims.

Finding 2

Strong Generator Effects

Detection difficulty varies substantially across generators. Average MLLM TPR is only 0.080 on GPT Image 2 but rises to 0.351 on Qwen-Image-Edit. Frontier text-to-image models produce more visually integrated fake-damaged evidence than image-editing pipelines.

Finding 3

Specialized Detectors: Strong but Unreliable

The best specialized detector (ForgeLens [GenImage]) achieves 0.904 balanced accuracy but is highly checkpoint-dependent. Effort [Chameleon] achieves high TPR (avg 0.854) but only TNR of 0.051 on real images — making it unsuitable for real-world refund adjudication.

Finding 4

Context Helps — When Properly Aggregated

Multi-image context significantly boosts Gemini 3 Flash from 0.720 (single-image+review) to 0.823 balanced accuracy (multi-image+review). Yet review text alone adds minimal gain (+0.005), revealing that models struggle to ground textual claims in visual evidence.

Finding 5

Damage Type Shapes Difficulty

Salient defects (cracked, leaking, shattered) are easier to detect (avg TPR ~0.25–0.30). Subtle deformations (packaging-damaged, bent/warped) are much harder (avg TPR ~0.12–0.14). Packaging-damage detection is near zero for several MLLMs.

Finding 6

Humans Also Struggle

Human participants outperform most MLLMs but still exhibit non-negligible error rates. For the easiest generator (Qwen-Image-Edit), human TPR is 0.709 vs. 0.793 TNR, confirming that even manual inspection struggles with these highly realistic AI manipulations.

Experimental Results

Single-Image, No-Review Setting · TPR ↑ = Fake-Damage Detection Rate · TNR ↑ = Real-Image Preservation Rate

Bold = Best in Group | Underline = Second Best | Red = Problematic Value

Multimodal Large Language Models (MLLMs)

Model	Fake-Damage TPR ↑ · by Generator						Overall		Real-Damaged TNR ↑
Model	GPT Image 2	Grok Imagine	Nano Banana 2	Wan2.7-Image	Qwen-Image-2.0	Qwen-Image-Edit	Bal.Acc ↑	F1 ↑	Real-Damaged TNR ↑
GPT-5.4 mini	0.040	0.045	0.083	0.129	0.147	0.285	0.558	0.212	0.994
Gemini 3 Flash	0.158	0.199	0.258	0.398	0.491	0.692	0.674	0.526	0.982
Grok 4.1 Fast Reasoning	0.105	0.135	0.144	0.188	0.218	0.262	0.538	0.279	0.901
Grok 4.20 Reasoning	0.062	0.065	0.141	0.199	0.161	0.242	0.562	0.244	0.979
Kimi K2.6	0.026	0.052	0.079	0.218	0.260	0.356	0.581	0.277	0.996
Qwen3.6-Plus	0.077	0.140	0.179	0.266	0.325	0.441	0.608	0.371	0.977
Qwen3.6-35B-A3B	0.144	0.242	0.262	0.365	0.407	0.531	0.603	0.476	0.882
Qwen3.5-Omni-Plus	0.008	0.035	0.049	0.103	0.137	0.202	0.544	0.159	1.000
Qwen3-VL-Flash	0.043	0.090	0.172	0.247	0.278	0.364	0.592	0.321	0.986
Qwen3-VL-Plus	0.002	0.006	0.024	0.050	0.049	0.108	0.518	0.075	0.997
QVQ-Max-Latest	0.220	0.239	0.294	0.321	0.335	0.383	0.510	0.438	0.721
Random Guessing	0.500	0.500	0.500	0.500	0.500	0.500	0.500	—	0.500
Human Reference	0.420	0.509	0.539	0.598	0.697	0.709	0.686	0.704	0.793

Specialized AI-Generated Image Detectors

Model	Fake-Damage TPR ↑ · by Generator						Overall		Real-Damaged TNR ↑
Model	GPT Image 2	Grok Imagine	Nano Banana 2	Wan2.7-Image	Qwen-Image-2.0	Qwen-Image-Edit	Bal.Acc ↑	F1 ↑	Real-Damaged TNR ↑
CO-SPY [ProGAN]	0.977	0.099	0.312	0.218	0.894	0.996	0.782	0.733	0.982
CO-SPY [SD-v1.4]	0.360	0.310	0.282	0.360	0.339	0.349	0.509	0.455	0.685
ForgeLens [ProGAN]	0.700	0.217	0.516	0.823	0.859	0.925	0.815	0.798	0.958
ForgeLens [GenImage]	0.967	0.158	1.000	0.990	1.000	0.989	0.904	0.915	0.957
Effort [SD-v1.4]	0.710	0.546	0.658	0.755	0.858	0.890	0.795	0.828	0.853
Effort [Chameleon]	0.828	0.824	0.868	0.882	0.849	0.872	0.452	0.838	0.051
IAPL [ProGAN]	0.255	0.295	0.140	0.122	0.398	0.568	0.628	0.438	0.960
IAPL [SD-v1.4]	0.973	0.895	0.916	0.996	0.955	0.994	0.727	0.932	0.499
Random Guessing	0.500	0.500	0.500	0.500	0.500	0.500	0.500	—	0.500
Human Reference	0.420	0.509	0.539	0.598	0.697	0.709	0.686	0.704	0.793

Confidence scores (Conf.) omitted for readability. Full results including Conf. are in the paper.

BibTeX

@misc{yan2026fraudbenchmultimodalbenchmarkdetecting, title = {FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence}, author = {Xinyu Yan and Boyang Chen and Jiaming Zhang and Tiantong Wu and Hong Xi Tae and Yichen He and Tiantong Wang and Yachun Mi and Yurong Hao and Yilei Zhao and Lei Xiao and Longtao Huang and Pengjun Xie and Wei Liu and Wei Yang Bryan Lim}, year = {2026}, eprint = {2605.08820}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2605.08820} }