1

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations

Humans are efficient language learners and inherently social creatures. Our language development is largely shaped by our social interactions, for example, the demonstration and feedback from caregivers. Contrary to human language learning, recent …

Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

Recent advancements in text-only large language models (LLMs) have highlighted the benefit of in-context learning for adapting to new tasks with a few demonstrations. However, extending in-context learning to large vision-language models (VLMs) using …

DriVLMe: Exploring Foundation Models as Autonomous Driving Agents That Perceive, Communicate, and Navigate

Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in …

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens. This paradigm lacks pixel-level representations …

Inversion-Free Image Editing with Natural Language

Despite recent advances in inversion-based editing, text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency …

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in …

Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation

Zero-Shot Object Navigation (ZSON) enables agents to navigate towards open-vocabulary objects in unknown environments. The existing works of ZSON mainly focus on following individual instructions to find generic object classes, neglecting the …

User Modeling Challenges in Interactive AI Assistant Systems

.

GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning

Pre-trained vision-language models (VLMs) have achieved promising success in many fields, especially with prompt learning paradigm. In this work, we propose GIP-COL (Graph-Injected Soft Prompting for COmpositional Learning) to better explore the …

CycleNet: Rethinking Cycle Consistent in Text‑Guided Diffusion for Image Manipulation

Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, …