Multimodal Learning for Situated Language Understanding

Last updated on Feb 14, 2025 3 min read projects

asd

Motivations and Objectives

Using situated dialogue (in the virtual world) and conversational interfaces as our setting, we have investigated the use of non-verbal modalities (e.g., eye gaze and deictic gestures) in language processing and in conversation grounding. The virtual world setting not only has important applications in education, training, and entertainment; but also provides a simplified simulation environment to support studies on situated language processing toward physical world interaction.

Selected Recent Papers

Language & 3D Vision

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass. Preprint, 2025.
Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai. 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination. Preprint, 2024.
Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David Fouhey, Joyce Chai. LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. ICRA, 2024.
Yichi Zhang, Jianing Yang, Jiayi Pan, Shane Storks, Nikhil Devraj, Ziqiao Ma, Keunwoo Peter Yu, Yuwei Bao, Joyce Chai. DANLI: Deliberative Agent for Following Natural Language Instructions. EMNLP, 2022.

Language & 2D Vision

Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa Kordjamshidi, Joyce Chai, Ziqiao Ma. Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities. ICLR, 2025. (Oral)
Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai. Multi-Object Hallucination in Vision-Language Models. NeurIPS, 2024.
Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, Joyce Chai. Inversion-Free Image Editing with Natural Language. CVPR, 2024.
Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, Joyce Chai. GROUNDHOG: Grounding Large Language Models to Holistic Segmentation. CVPR, 2024.
Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai. Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?. EMNLP, 2023.
Sihan Xu, Ziqiao Ma, Yidong Huang, Honglak Lee, Joyce Chai. CycleNet: Rethinking Cycle Consistent in Text‑Guided Diffusion for Image Manipulation. NeurIPS, 2023.
Ziqiao Ma, Jiayi Pan, Joyce Chai. World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models. ACL, 2023. (Outstanding Paper Award)

Language & Eye Gaze

Zahar Prasov and Joyce Chai. Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue. EMNLP, 2010
Shaolin Qu and Joyce Chai. Incorporating Temporal and Semantic Information with Eye Gaze for Automatic Word Acquisition in Multimodal Conversational Systems. EMNLP, 2008.
Shaolin Qu and Joyce Chai. An Exploration of Eye Gaze in Spoken Language Processing for Multimodal Conversational Interfaces. NAACL, 2007.