ATTACH - Dataset



For a cobot to act autonomously and as an assistant, it must understand human actions during assembly. To effectively train models for this task, a dataset containing suitable assembly actions in a realistic setting is crucial. For this purpose, we provide the Annotated Two-Handed Assembly Actions for Human Action Understanding (ATTACH) dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras, which represent potential viewpoints of a cobot.


Aganian, et. al: ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action Understanding. ICRA, 2023


For download, the processed ATTACH dataset as well as the raw data are available.

To get access via FTP, please send the completed form by email to (for research purposes only).

Dataset as Described in our ICRA 2023 Paper

We provide the ATTACH dataset as described in our ICRA 2023 paper “ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action Understanding”.

It includes processed action labels with training, validation, and test split assignment for both person and camera splits. Also processed skeletons for MMAction and VACNN, as well as processed RGB images for MMAction are provided. All label information and skeletons can also be found in one single pandas dataframe.

This processed dataset can be found in the folder "icra_processed_attach_dataset". For more information please read the readme.txt file in this folder.

Raw Data

We provide color, depth, and IR images, estimated 3D Kinect Azure skeletons, as well as extrinsic and intrinsic calibration files for all cameras together with detailed annotations for performed assembly actions.

Color, depth, and IR images are provided as videos for each of the 378 recordings (42 participants x 3 sets of instructions x 3 cameras). We also provide python scripts to extract the respective image files (ffmpeg required).

All these raw data can be found in the folder “raw_attach_dataset”. For more information please read the readme.txt file in this folder.



Aganian, D., Stephan, B., Eisenbach, M., Stretz, C., Gross, H.-M.
ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action Understanding
in: International Conference on Robotics and Automation (ICRA), London, UK, pp. 1-7, IEEE 2023

  title={ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action Understanding},
  author={Aganian, Dustin and Stephan, Benedict and Eisenbach, Markus and Stretz, Corinna and Gross, Horst-Michael},
  booktitle={International Conference on Robotics and Automation (ICRA)},

Dataset Description

For the ATTACH dataset, we asked 42 participants to assemble cabinets consisting of 26 parts, following three different sets of instructions. Thus, we have 17.2 hours of recording time per view (51.6 hours in total) and an average duration of 8.2 minutes per recording (with 378 recordings in total). Each fine-grained action (e.g., picking up a board with the left hand, attach an object with a screwdriver in the right hand) was annotated, resulting in a total of 95.2k annotations for 51 distinct action classes (with an average of 252 annotations per video).

Unique Features

During the recording of the ATTACH dataset, we focused on the following features:

1) Simultaneous fine-grained labels: During assembly most workers often perform different actions simultaneously with their left and right hands, e.g., picking something up with one hand and holding something with the other hand. In contrast to previous single-label assembly action datasets, which do not represent such behavior, we did not restrict the participants to perform different actions with both hands and also labeled all available actions per hand for each time frame. Thus, more than 54% of all frames have more than one label describing the actions that occur and more than 68% of annotations overlap with another annotation.

2) Diverse and dynamic assembly actions: In creating the dataset, we took special care to ensure that participants received as few instructions as possible, i.e., they were not given a script to follow, as is often the case in related work. Instead, they received only various written superficial instructions, such as those typically included with furniture for self-assembly. Due to the variety of the parts to be assembled, actions also varied significantly in time. They ranged from a fifth of a second for actions like lifting an object or rotating a workpiece to a minute or two for actions such as attaching an object with a wrench. Furthermore, each participant has a different level of craftsmanship, resulting in a very large variance in length and execution of the various actions.


The setup for data recording is shown in Fig. 1. A worktable is monitored by three Azure Kinect cameras, which capture the frontal, left, and right views of the worker assembling a piece of furniture, resembling typical observation positions of an assistive robot. Each camera is connected to a separate PC and records RGB images with a resolution of 2560×1440 pixels and depth/IR images with a resolution of 320×288 pixels at 30 FPS. Based on the known camera parameters, registered depth images with the same resolution as the RGB images can be calculated. Extrinsic calibration is realized by a cube with ArUco markers placed in the center of the worktable before a recording took place. This position marks the center of the global coordinate system. For all images captured, we recorded their globally synced timestamps, enabling to match corresponding images across views if necessary. For each camera, the Azure Kinect body tracking SDK is employed, which uses the depth and IR images to extract a 3D skeleton of the worker.

Fig. 1. Setup of the ATTACH dataset (top left), exemplary recorded data, and annotations (bottom). The views of the ATTACH dataset are representative of possible monitoring perspectives of our exemplary cobot. The red line in the annotation diagram (bottom) marks the timestamp at which the above images where recorded. The annotations show which actions were performed before (e.g., plugging in a leg) and after (e.g., holding a wrench) with each of both hands. It can be seen that several actions temporally overlap, which is the focus of the proposed ATTACH dataset.

Data and Annotations

Assembly task: In each recording, the furniture to be assembled are IKEA cabinets, each consisting of 26 parts. We created three versions of the assembly instructions, which differed in the order of the assembly steps and the actions to be performed, such as performing certain actions with bare hands or with a tool. Each of the recorded subjects had to assemble the piece of furniture according to the construction manual. As our manuals only consisted of goal-oriented instructions, like in furniture assembly manuals, we did not specify how to achieve the next step. On average, it took the participants 8.2 minutes to assemble the piece of furniture. Overall, we recorded three complete assemblies for all 42 participants from three different viewpoints each resulting in 378 recordings and 51.6 hours of recordings in total.

Participant statistics: We recorded 42 participants with different level of experience in assembling of which 31 were male and 11 were female. The age of the participants ranged from 21 to 67.

Annotations: The recorded data is annotated in detail, as shown in Fig. 2. The type of object on which a particular action is performed is distinguished. This is
necessary for follow-up tasks to identify specific assembly steps. We distinguish actions performed on five types of objects: ”Object” in Fig. 2 is a small object, such as a screw, that can be enclosed by one hand and of which it is easy to hold multiple instances in one hand. Actions performed on the walls of the cabinet or the like are included in the ”board” category. Actions performed on partially assembled furniture are grouped in the ”workpiece” category. When the subject uses a tool, the corresponding action is placed in the respective category, unless it is applied directly to a specific object or workpiece. In that case, we annotated it as attaching an object with a specific tool or pressing with a tool. The category ”other” contains actions that are performed with the
construction manual (e.g. reading, browsing).

Using this scheme, we get 51 action classes as shown in Fig. 2. For each category, we annotated several actions separately for both hands. This means, that the subject can perform one action with the left hand while simultaneously performing another action with the right hand, e.g., as shown in Fig. 1. This results in more than one label for 54% of all frames and more than 68% of annotations overlapping with another annotation. Overall, the data are annotated with 95.2k annotations, which corresponds to 252 annotations per assembly sequence on average. A histogram of the duration of the performed actions in our dataset can be found in Fig. 3.

Dataset splits

For evaluations on our dataset, we use a person and a view split, similar to other datasets like TSU. Person split: We split our participants into three groups, with recordings from two-thirds of all participants (28) used for training and the remaining third split into validation (4 participants) and test data (10 participants). Care was taken to ensure that all action classes appear with sufficient frequency in both the test and training splits. View split: The camera views shown in Fig. 1 were split as follows: CamRight was used as test data, while recordings from CamFront and CamLeft were used for the training and validation splits. As the views already have a drastically different perception of the scene, we chose not to assign another separate camera for validation. Instead, 10% of all recordings from the front and left camera were assigned for validation. As we will show in our experiments below, splitting the view is a major challenge because the scene looks vastly different from each point of view. Furthermore, in at least one of the views, the person and the action performed are always partially obscured by furniture parts. This split represents the situation of a mobile cobot viewing the scene from a different perspective than those available during training.

Benchmark Results

First benchmark results for action recognition of video clips and 3D-skeleton sequences can be found in our paper“ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action Understanding”.


This work has received funding from the Carl-Zeiss-Stiftung as part of the project "Engineering for Smart Manufacturing" (E4SM).