OHO: A Multi-Modal, Multi-Purpose Dataset for Human-Robot Object Hand-Over


Purpose and content

The dataset aims for providing a base for training of machine learning problems in various domains in the context of robotic grasping of objects presented by a human.

  • Image based hand object segmentation
  • Point Cloud based hand object segmentation
  • 3D object reconstruction from RGB-D data or point clouds data
  • 6D pose estimation of objects from either images depth or pointclouds

Therefore, the dataset consists of serveral data fields captured synchronously, as well as automatic generated label information.
There are two parts in the database. First is a dataset of Background images that are used for augmentation of the hand object szenes consisting of point clouds RGB, thermo, and depth images.
Second is the hand object data. Here, a green screen is used for recording objects with the holding hand that allows for automatic segmentation and augmentation with the backgrounds.


A) Backgrounds

There are saparate folders for the individual samples containing each the following files:

  • RGB and depth image of a Astra Orbbec S camera,
  • RGB and registered depth image of a Azure Kinect,
  • Thermal image of a TE-Q1 thermal camera
  • Intrinsics for all images
  • a transforms.xml containing extern camera parameters (poses) of all images all with respect to a reference frametimestamps of all images
  • a properties.xml file containing the associated split as well as some meta data e.g. if the image shows people in the background.

B) Hand/Objects

For each object there is a folder containing many samples of the objects in several positions.
For each position there is a reference image of the object without the human hand followed by a couple of images showing a human hand holding the object still in its fixed position.

For each sample the following data fields have been recorded:

  • RGB and depth image of a Astra Orbbec S camera, (this camera suffers from a thermal drift in the depth data that has been manually compensated. Therefore the images are not the raw images captured but they are scaled pixelwise to match the point cloud position of the Kinect4 azure)
  • RGB and registered depth image of a Azure Kinect,
  • Thermal image of a TE-Q1 thermal camera
  • Intrinsics for all images
  • a transforms.xml containing extern camera parameters (poses) of all images as well as the pose of the object (ObjectFrame) and the tripod (BoardFrame) all with respect to a reference frame
  • mesh of the object model
  • meta data in a properties.xml file containing the link to the reference image and several flags concerning the hand and objects as well as properties of the automatic labeling.
  • point clouds of hand and object with RGB and thermo (in the alpha channel) taken from the Astra Orbbec S images (labeling based on point distance to the registered object mesh),
  • timestamps of all images

Recording procedure

Cameras have been calibrated using the Azure Kinect as a reference (this device provides its intrinsic and external parameters).
The intrinsics of the cameras of Astra Orbbec S and the thermo camera have been estimated using a checkerboard.
Unfortunately, the resulting focal lengths are not correct, such that the resulting point clouds do not match.
A manual adjustment has been done in order to make the point clouds match each other visually.
The external positions of the cameras have been measured and the rotation was adjusted manually in order to bring the point clouds in a match.
A problem is the Astra Orbbec camera, which has a known bug. The depth data of that camera is drifting away over time due to thermal problems in the device.
A work around is a manual correction of the depth data which is done by multiplying the depth values with a factor that is changing with the x position in the image.  

The setup consists of a green screen background, a mounting post also covered by a green screen and the cameras.
Furthermore, there is a replaceable aruco marker board that defines the reference position of the mounting pole.
That marker board is used to define the 3D position of a region of interest, where a cube is cut from the point cloud.
Also the mounting pole is removed from the point cloud by means of that board position.
After the mounting pole position is defined, the 3D mesh model is fitted into the remaining pointcloud using the ICP algorithm.
Then all markers are removed and the green screen is placed to cover the background.
Now a reference sample is taken containing the object only.
Afterwards, the holding hand can reach into the scene and the point cloud is segmented automatically into hand and object by means of the point distance to the mesh object.
The parameters of that labeling procedure are contained in the properties file.
The point cloud of the Azure Kinect is not suitable for that labeling due to the blurry depth image which leads to false 3d points at object borders, that can not be distinguished from the hand / background.
Therefore, we use the Astra Orbbec S point cloud for labeling, while the aruco marker board is detected in the Azure Kinect full HD image.
The RGB image of the Astra Orbbec is to low res for a robust marker detection.

Split of the dataset

The meta data for each sample contains a <split> tag which marks a sample to belong to one of the four subsets: train, validation, test, testUnseenObject
The split has been done such that the subsets do not contain objects in the same pose.
Additionally, the testUnseenObject contains object instances that have not been in the training and validation at all.


If you consider using the data sets on this page, please reference the following:

Stephan, B.; Köhler, M.; Müller, S.; Zhang, Y.; Gross, H.-M.; Notni, G.
OHO: A Multi-Modal, Multi-Purpose Dataset for Human-Robot Object Hand-Over.
in:  Sensors. 2023; 23(18):7807. https://doi.org/10.3390/s23187807


Data set website: www.tu-ilmenau.de/neurob/data-sets-code/oho-dataset
To get access via FTP, please send the completed form by email to nikr-datasets-request@tu-ilmenau.de (for research purposes only).