1

4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model …

Benchmarking and Improving LLM Robustness for Personalized Generation

Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user's preferences, we argue that factuality is an equally …

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their …

Transparent and Coherent Procedural Mistake Detection

Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine …

VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation

Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor …

Bootstrapping Visual Assistant Modeling with Situated Interaction Simulation

Visual assistants that can guide humans through complex tasks in physical environments have significant potential, yet their development is hindered by the high cost of human-in-the-loop data collection. We present BASIS (Bootstrapping Assistant …

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). …

AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies

In this paper, we propose AimBot, a lightweight visual augmentation technique that provides explicit spatial cues to improve visuomotor policy learning in robotic manipulation. AimBot overlays shooting lines and scope reticles onto multi-view RGB …

Flexible Physical Problem Solving with Strategy Acquisition and Composition

Humans exhibit a remarkable ability to acquire, generalize, and compose strategies for object manipulation, yet the un- derlying mechanisms of this flexible strategy learning and reuse remain poorly understood. In this paper, we extend the Virtual …

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Internal world models (WMs) enable agents to understand the world's state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit …