Realistic multimodal misinformation detection

ReMMD: Realistic Multilingual Multi-Image Agentic Verification

ReMMD pairs ReMMDBench, a multilingual multi-image benchmark with graded veracity and distortion labels, with ReMMD-Agent, a persistent-memory verifier for evidence-centered multimodal misinformation detection.

Chenhao Dang1,2, Dantong Zhu4, Jun Yang5, Conghui He2,*, Weijia Li3,*

1Shanghai Jiao Tong University · 2Shanghai AI Laboratory · 3Tsinghua University · 4Central South University · 5CETC 15th Research Institute · *Corresponding authors

500 samples
2,756 images
5 + 2 languages and cross-lingual settings
5-way L1 veracity labels
8 L2 distortion labels

ReMMDBench

A benchmark built around operational verification pressure

ReMMDBench moves beyond single-image binary detection by combining long multilingual narratives, carousel-style image sets, mixed provenance, five-way veracity, eight distortion labels, and audited rationales.

Multilingual

Five monolingual languages plus cross-lingual transfer

English, Chinese, German, Japanese, and French samples expose regional evidence access, entity anchoring, and cross-script variation.

Multi-image

Posts with evidence, decoration, reuse, and edited visuals

Each verifier must decide which images matter, which are persuasive context, and which introduce visual or cross-modal distortion.

Hierarchical labels

L1 verdicts, L2 distortion diagnosis, and L3 rationales

The task asks systems to calibrate partial truth, diagnose the mechanism of distortion, and justify the decision.

ReMMDBench construction pipeline
Benchmark construction controls topic, language, length, image provenance, label conditions, and evidence validation.
ReMMDBench statistics panels
Language, distortion, image provenance, and text-length distributions in ReMMDBench.

ReMMD-Agent

Persistent evidence memory before judgment

ReMMD-Agent decomposes posts into atomic claims, observations, and image bindings, retrieves multimodal evidence, reuses a memory bank, and predicts structured L1/L2/L3 outputs from an explicit evidence state.

ReMMD-Agent pipeline
Atomic parsing and persistent evidence reuse reduce redundant retrieval while keeping judgments traceable.
Atomic parsing Separates central claims, image observations, and text-image bindings before retrieval.
Memory reuse Stores reusable evidence so related claims and images do not trigger redundant tool calls.
Structured judgment Produces five-way veracity, eight-label distortion diagnosis, and a concise rationale.

Leaderboard

Main ReMMDBench results

On the full 500-sample split, ReMMD-Agent with GPT-5.2 achieves the best five-way veracity performance, while ReMMD-Agent with Qwen3.5-9B gives the strongest L2 macro-F1 among comparable open-backbone runs.

System Backbone L1 Acc. L1 Macro-F1 L2 Macro-F1 L2 Exact
Manus proprietary 30.00 28.35 40.06 2.20
ChatGPT proprietary 30.20 28.24 43.63 3.00
MMD-Agent GPT-5.2 26.40 23.42 41.98 2.00
T2-Agent GPT-5.2 28.20 26.00 42.68 2.60
ReMMD-Agent GPT-5.2 41.80 39.12 45.15 5.00
ReMMD-Agent Qwen3.5-9B 37.20 37.18 46.97 10.00
41.80% best L1 accuracy
39.12% best L1 macro-F1
-17.5% cost vs. MMD-Agent
-79.9% cost vs. T2-Agent
Fine-grained analysis by length, language, and labels
Fine-grained Qwen3.5-9B analysis across text length, language, and distortion labels.

Sample gallery

Fifteen real benchmark samples across language and length

The gallery uses one short, one medium, and one long sample from each language. Cards keep long text compact, while each sample opens into a full text and multi-image viewer.

Citation

Reference

@article{dang2026remmd,
  title = {ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection},
  author = {Dang, Chenhao and Zhu, Dantong and Yang, Jun and He, Conghui and Li, Weijia},
  year = {2026}
}