SafetyALFRED is a benchmark extending ALFRED with six kitchen hazard categories. We evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on hazard recognition and risk mitigation through embodied planning. Our findings reveal a significant alignment gap between models' ability to identify hazards via question-answering and their capacity to actively mitigate risks in embodied contexts. We argue that static QA-based evaluations are insufficient for physical safety assessment and advocate for benchmarks that emphasize corrective action in embodied environments.