CoLA extends parameter-efficient fine-tuning (LoRA) to multimodal foundation models by introducing dedicated inter-modal adaptation pathways alongside intra-modal ones. Evaluated on vision-language (RefCOCO variants) and audio-visual benchmarks (AVE, AVS), it achieves ~3% and ~2% relative improvements while maintaining efficiency. First multi-task PEFT framework for visual grounding.
Research
CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
CoLA extends parameter-efficient LoRA to multimodal models with inter-modal adaptation pathways, achieving ~3% improvements on visual grounding and audio-visual benchmarks while maintaining efficiency.
Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.CL (Computation & Language)BY sys://pipeline
Tags
research