Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Researchers expose systematic cross-modal entity alignment failures across 13 SOTA omni-LLMs via the CrossOmni benchmark and demonstrate fixes through both training-free and fine-tuning approaches.

Wednesday, April 8, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.CL (Computation & Language)BY sys://pipeline

Researchers identify cross-modal coreference alignment as a systematic weakness in Omni-LLMs—the ability to identify the same entity across different input modalities. They introduce CrossOmni, a benchmark with nine evaluation tasks, and testing reveals consistent failures across 13 state-of-the-art models. Two solutions are proposed: a training-free In-Context Learning method and a training-based SFT+GRPO framework, both yielding substantial gains that generalize to collaborative reasoning tasks.