Snehesh Shrestha, Yantian Zha, Saketh Banagiri, Ge Gao, Yiannis Aloimonos, Cornelia Fermüller
Perception and Robotics Group
University of Maryland College Park
Abstract
Recent advances in multimodal Human-Robot Interaction (HRI) datasets emphasize the integration of speech and gestures, allowing robots to absorb explicit knowledge and tacit understanding. However, existing datasets primarily focus on elementary tasks like object pointing and pushing, limiting their applicability to complex domains. They prioritize simpler human command data but place less emphasis on training robots to correctly interpret tasks and respond appropriately. To address these gaps, we present the NatSGLD dataset, which was collected using a Wizard of Oz (WoZ) method, where participants interacted with a robot they believed to be autonomous. NatSGLD records humans’ multimodal commands (speech and gestures), each paired with a demonstration trajectory and a Linear Temporal Logic (LTL) formula that provides a ground-truth interpretation of the commanded tasks. This dataset serves as a foundational resource for research at the intersection of HRI and machine learning. By providing multimodal inputs and detailed annotations, NatSGLD enables exploration in areas such as multimodal instruction following, plan recognition, and human-advisable reinforcement learning from demonstrations.
We release the dataset, simulator, and code under the MIT License at NatSGLD website to support future HRI research.
Acknowledgments
We thank our peers, faculties, staff, and research assistants from the Perception and Robotics Group (PRG), the University of Maryland Computer Science department (UMD CS), and the University of Maryland Institute for Advanced Computer Studies (UMIACS) for their continued support, their valuable discussions and feedback.
Special thanks to Dr. Vibha Sazawal, Dr. Michelle Mazurek, Vaishnavi Patil, Lindsay Little, Dr. Nirat Saini, Dr. Virginia Choi, Dr. Chethan Parameshwara, Dr. Nitin Sanket, and Dr. Chahat Deep Singh. We want to thank and recognize the contributions of Aavash Thapa, Sushant Tamrakar, Jordan Woo, Noah Bathras, Zaryab Bhatti, Youming Zhang, Jiawei Shi, Zhuoni Jie, Tianpei Gu, Nathaneal Brain, Jiejun Zhang, Daniel Arthur, Shaurya Srivastava, and Steve Clausen. Without their contributions and support, this work would not have been possible. Finally, a special thanks to Dr. Jeremy Marvel of the National Institute of Standards and Technology (NIST) for his support and advice.
The work was supported by NSF grant OISE 2020624.
Bibtex
@inproceedings{shrestha2025natsgld,
title = {{NatSGLD}: A Dataset with {S}peech, {G}estures, {L}ogic, and Demonstrations for Robot Learning in {Nat}ural Human-Robot Interaction},
author = {Snehesh Shrestha, Yantian Zha, Saketh Banagiri, Ge Gao, Yiannis Aloimonos, Cornelia Fermüller},
year = {2025},
booktitle={2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI)}
}
License
NatSGLD is freely available for non-commercial and research use and may be redistributed under the MIT License. For commercial licensing, or if you have any questions, please get in touch with me at snehesh@umd.edu.