Co-Chairs
Huang-Chia Shih, National Central University, TW
Chih-Chung Hsu, National Yang Ming Chiao Tung University, TW
Aim and Scope
With the rapid advancement of multi-modal Large Language Models (LLMs), the boundary between vision, language, audio, motion signals, and structured data has become increasingly interconnected. Multi-modal LLMs are now capable of interpreting complex scenes, aligning cross-domain information, performing high-level reasoning, and enhancing traditional computer vision pipelines with richer semantic understanding and contextual intelligence. This special session aims to gather cutting-edge research on the integration of multi-modal LLMs with computer vision techniques. The goal is to explore how multi-modal alignment, cross-modal reasoning, and LLM-enabled perception frameworks can advance visual tasks in accuracy, robustness, and interpretability.
We welcome contributions that develop new architectures, algorithms, foundational models, efficient training strategies, or real-world applications enabled by multi-modal LLM–vision integration. The scope spans theoretical foundations, methodological innovation, and practical implementations across domains such as robotics, healthcare imaging, sports biomechanics, autonomous systems, smart cities, and video understanding.
.
Topic of Interest
Topics of interest include, but are not limited to:
Multi-modal LLM–Vision Integration
- Multi-modal LLMs for image, video, 3D, and sensor fusion tasks
- LLM-guided vision models integrating text, audio, skeleton, IMU, or other modalities
- Cross-modal embedding alignment and representation learning
- Multi-modal prompt engineering and instruction tuning for vision applications
Reasoning and Understanding Across Modalities
- Scene understanding through multi-modal reasoning
- Vision–language–audio models for contextual interpretation
- LLM-based temporal and spatial reasoning in video or 3D data
- Multi-modal question answering, dialog systems, and interactive perception
Enhanced Perception and Intelligent Systems
- Multi-modal LLMs for robotics, autonomous vehicles, drones, and industrial automation
- LLM-assisted action recognition, behavior prediction, and trajectory reasoning
- Multi-modal fusion for anomaly detection and complex event understanding
- Agent-based frameworks leveraging LLM reasoning for decision making
Applications in Science, Health, and Society
- Medical imaging enhanced by multi-modal LLM-driven interpretation
- Sports analytics (e.g., baseball, golf, biomechanics) combining vision + sensor signals
- Remote sensing, geospatial intelligence, and environmental monitoring
- Smart city applications using multi-modal surveillance or IoT data
Performance, Robustness, and Interpretability
- Benchmarks for multi-modal LLM–vision systems
- Robustness under occlusion, noise, resolution variation, or modality dropout
- Explainability, trustworthiness, and safety analysis in LLM-augmented CV systems
- Model compression, acceleration, and deployment for real-time applications
Datasets, Tools, and Emerging Paradigms
- New multi-modal datasets for training and evaluating LLM–vision models
- Toolkits, frameworks, or agents enabling multi-modal CV workflows
- Multi-modal diffusion models, neural fields, and multi-sensory representations
- Retrieval-augmented generation (RAG) for visual and cross-modal tasks