NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot Learning in Natural Human-Robot Interaction

Snehesh Shrestha, Yantian Zha, Ge Gao, Cornelia Fermüller, Yiannis Aloimonos

Perception and Robotics Group

University of Maryland College Park







Recent advances in multimodal Human-Robot Interaction (HRI) datasets feature speech and gesture. These datasets provide robots with more opportunities for learning both explicit and tacit HRI knowledge. However, these speech-gesture HRI datasets focus on simpler tasks like object pointing and pushing, which do not scale easily to complex domains. These datasets focus more on collecting human command data but less on corresponding robot behavior (response) data. In this work, we introduce NatSGD, a multimodal HRI dataset that contains human commands as speech and gestures, along with robot behavior in the form of synchronized demonstrated robot trajectories. Our data enable HRI with Imitation Learning so that robots can learn to work with humans in challenging, real-life domains such as performing complex tasks in the kitchen. We propose to train benchmark tasks that enable smooth human-robot interaction, including 1) language-and-gesture-based instruction following and 2) task understanding (Linear Temporal Logic formulae prediction) to demonstrate the utility of our dataset. The dataset and code are available at

This video is a short sample generated by playing back a demonstration trajectory from the NatSGD dataset. In this clip, the robot is following human instructions to cook a vegetable soup dish. It includes covering a pot of water being heated and fetching an onion and a carrot. For each activity, it is guided by human speech which is displayed (as closed caption texts) at the bottom of the screen. We also overlay gestures as images with skeleton keypoints in the middle of the video. The video is a mosaic of 8 different camera views and modalities. The top row shows camera perspectives from different angles. From left to right, the first view is a non-stationary camera view which is what the human participant saw during the interaction. The other three are fixed cameras with third-person views. The bottom row cameras are the robot’s first-person view from the camera mounted on the robot’s head. From left to right, the first one is an RGB image. The second is a depth image. The third view is an instance segmentation i.e. color is assigned to each object of individual identity. And the last view is a semantic segmentation where each color represents a single category such as appliances, utensils, and food. The depth view gives the perspective of the shape of each object and how far they are. The instance segmentation helps the robot understand the boundary of each object as well as how they interact with each other. And finally, semantic segmentation helps the robot understand the common properties of objects belonging to each category and can help generalize beyond the dataset.

Below we use two examples to show why having both speech and gestures is important.

Example 1:
From time 24 sec to 26 sec, the participant instructs the robot to place the onion on the chopping board. The participant points at the chopping board and says “now place it right there”. The phrase “right there” implies the specific place the participant is pointing at.

Example 2:
Similarly, from the time 40 sec to 44 sec, the participant points to the countertop and instructs “between the potato, right there.” While this is syntactically incorrect and semantically incomplete, spoken instructions tend to follow this style where gesture or context is implied. This instruction implies between the potato and the apple which the participant does not specify explicitly.

Example 3:
From time 18 sec to 22 sec, the participant augments prior instruction by providing additional information to help the robot that seems unsure. There are ambiguities such as where the onion is or which onion of the three (that are next to each other) to pick up. In this case, the participant points out and describes the spatial relationship of the object with the robot.


We would like to thank peers and faculties from the UMD CS department and the Perception and Robotics Group for their valuable feedback and discussions. Special thanks to Lindsay Little, Dr. Vibha Sazawal, Dr. Michelle Mazurek, Nirat Saini, Virginia Choi, Dr. Chethan Parameshwara, and Dr. Nitin Sanket. We want to thank and recognize the contributions of Aavash Thapa, Sushant Tamrakar, Jordan Woo, Noah Bathras, Zaryab Bhatti, Youming Zhang, Jiawei Shi, Zhuoni Jie, Tianpei Gu, Nathaneal Brain, Jiejun Zhang, Daniel Arthur, Shaurya Srivastava, and Steve Clausen. Without their contributions and support, this work would not have been possible. Finally, the support of NSF under grant OISE 2020624 is greatly acknowledged.


  title     = {{NatSGD}: A Dataset with {S}peech, {G}estures, and Demonstrations for Robot Learning in {Nat}ural Human-Robot Interaction},
  author    = {Snehesh Shrestha, Yantian Zha, Ge Gao, Cornelia Fermüller, Yiannis Aloimonos},
  year      = {2022},


NatSGD is freely available for non-commercial and research use and may be redistributed under the conditions detailed on the license page. For commercial licensing or if you have any questions, please get in touch with me at