NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot Learning in Natural Human-Robot Interaction

Snehesh Shrestha, Yantian Zha, Ge Gao, Cornelia Fermüller, Yiannis Aloimonos

Perception and Robotics Group

University of Maryland College Park







Recent advances in multimodal Human-Robot Interaction (HRI) datasets feature speech and gesture. These datasets provide robots with more opportunities for learning both explicit and tacit HRI knowledge. However, these speech-gesture HRI datasets focus on simpler tasks like object pointing and pushing, which do not scale easily to complex domains. These datasets focus more on collecting human command data but less on corresponding robot behavior (response) data. In this work, we introduce NatSGD, a multimodal HRI dataset that contains human commands as speech and gestures, along with robot behavior in the form of synchronized demonstrated robot trajectories. Our data enable HRI with Imitation Learning so that robots can learn to work with humans in challenging, real-life domains such as performing complex tasks in the kitchen. We propose to train benchmark tasks that enable smooth human-robot interaction, including 1) language-and-gesture-based instruction following and 2) task understanding (Linear Temporal Logic formulae prediction) to demonstrate the utility of our dataset. The dataset and code are available at


We would like to thank peers and faculties from the UMD CS department and the Perception and Robotics Group for their valuable feedback and discussions. Special thanks to Lindsay Little, Dr. Vibha Sazawal, Dr. Michelle Mazurek, Nirat Saini, Virginia Choi, Dr. Chethan Parameshwara, and Dr. Nitin Sanket. We want to thank and recognize the contributions of Aavash Thapa, Sushant Tamrakar, Jordan Woo, Noah Bathras, Zaryab Bhatti, Youming Zhang, Jiawei Shi, Zhuoni Jie, Tianpei Gu, Nathaneal Brain, Jiejun Zhang, Daniel Arthur, Shaurya Srivastava, and Steve Clausen. Without their contributions and support, this work would not have been possible. Finally, the support of NSF under grant OISE 2020624 is greatly acknowledged.


  title     = {{NatSGD}: A Dataset with {S}peech, {G}estures, and Demonstrations for Robot Learning in {Nat}ural Human-Robot Interaction},
  author    = {Snehesh Shrestha, Yantian Zha, Ge Gao, Cornelia Fermüller, Yiannis Aloimonos},
  year      = {2022},


NatSGD is freely available for non-commercial and research use and may be redistributed under the conditions detailed on the license page. For commercial licensing or if you have any questions, please get in touch with me at