Skip to content

Collect Data

We provide over 100,000 pre-collected trajectories as part of the open-source release RoboTwin Dataset. However, we strongly recommend users to perform data collection themselves due to the high configurability and diversity of task and embodiment setups.

Running the following command will first search for a random seed for the target collection quantity, and then replay the seed to collect data.

bash collect_data.sh ${task_name} ${task_config} ${gpu_id}
# Example: bash collect_data.sh beat_block_hammer demo_randomized 0

After data collection is completed, the collected data will be stored under data/${task_name}/${task_config}.

  • Each trajectory's observation and action data are saved in HDF5 format in the data directory.
  • The corresponding language instructions for each trajectory are stored in the instructions directory.
  • Head camera videos of each trajectory can be found in the video directory.
  • The _traj_data, .cache, scene_info.json, and seed.txt files are auxiliary outputs generated during the data collection process.

All available task_name options can be found in the documentation. The gpu_id parameter specifies which GPU to use and should be set to an integer in the range 0 to N-1, where N is the number of GPUs available on your system.

Our data synthesizer enables automated data collection by executing the task scripts in the envs directory, in combination with the curobo robot planner. Specifically, data collection is configured through a task-specific configuration file (see the tutorial in ./configurations.md), which defines parameters such as the target embodiment, domain randomization settings, and the number of data samples to collect.

The success rate of data generation for each embodiment across all tasks can be found at: https://robotwin-platform.github.io/doc/tasks/index.html. Due to the structural limitations of different robotic arms, not all embodiments are capable of completing every task.

Our pipeline first explores a set of random seeds (seed.txt) to identify trajectories that can yield successful data collection. It then records fine-grained action trajectories (_traj_data) accordingly. Collected videos are available in the videos directory.

The entire process is fully automated—just run a single command to get started.

⚠️ The missing pytorch3d warning can be ignored if 3D data is not required.