SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Abstract

SafetyALFRED is a benchmark extending ALFRED with six kitchen hazard categories. We evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on hazard recognition and risk mitigation through embodied planning. Our findings reveal a significant alignment gap between models' ability to identify hazards via question-answering and their capacity to actively mitigate risks in embodied contexts. We argue that static QA-based evaluations are insufficient for physical safety assessment and advocate for benchmarks that emphasize corrective action in embodied environments.

Publication
Findings of the Association for Computational Linguistics (ACL)
Josue Torres-Fonseca
Josue Torres-Fonseca
Ph.D. Student, NSF GRFP Fellow
Yinpei Dai
Yinpei Dai
Ph.D. Candidate
Shane Storks
Shane Storks
Ph.D. Candidate
Yichi Zhang
Yichi Zhang
Ph.D. Candidate