LLaVA-VLA¶

Contributed by IRPN Lab, HKUST(GZ)
Email: songwenxuan0115@gmail.com, sunxiaoquan@hust.edu.cn

1. Environment Setup¶

See LLaVA-VLA installation for more details.

2. Download Model¶

Please download the corresponding model from the model zoo.

3. Collect RoboTwin Data¶

See RoboTwin Tutorial (Usage Section) for more details.

4. Generate Image and Data¶

First, create the pictures folder in the policy/LLaVA-VLA directory:

mkdir pictures && training_data
cd scripts && cd helper

Then, extract the original image from RoboTwin data.

bash image_extraction.sh ${task_name} ${task_config}
# bash image_extraction.sh grab_roller demo_randomized
# bash image_extraction.sh all demo_randomized
# In task_name, you can directly select a task(such as: grab_roller) or choose "all" (just modify it in task_list).

Next, generate the format data required for LLaVA-VLA training.

bash process_data.sh ${task_name} ${task_config} ${future_chunk}
# bash process_data.sh grab_roller demo_randomized 5
# bash process_data.sh all demo_randomized 5
# In task_name, you can directly select a task(such as: grab_roller) or choose "all" (just modify it in task_list). 
# future_chunk: The number of output steps in the future (default is 5).

Example folder structure:

training_data
├── ${task_1}
│   ├── ${task_config_1}
|   |   |── episode0.json
|   |   |── episode1.json
│   ├── ${task_config_2}
|   |   |── episode0.json
|   |   |── episode1.json
├── ${task_2}
│   ├── ...
├── ...

pictures
├── ${task_1}
│   ├── ${task_config_1}
|   |   |── episode0
|   |   |   |── 01.jpg
|   |   |   |── 02.jpg
│   ├── ${task_config_2}
|   |   |── episode0
|   |   |   |── 01.jpg
|   |   |   |── ...
├── ${task_2}
│   ├── ...
├── ...

5. merge json and Generate yaml file¶

In this step, we need to merge all the JSON files generated by the previous process_data step into a single JSON file.

python llava/process_data/merge_json.py
# please replace `yourpath` with your actual path!

python llava/process_data/yaml_general.py

6. Pre-Training¶

Before starting the training, please replace yourpath with your actual path!

bash calvin_finetune_obs.sh

7. Fine-tuning¶

Please note to change MODEL_NAME_OR_PATH to the checkpoint generated in the previous step. For the dataset you fine-tuned, please regenerate the ACTION_STAT file and modify JSON_PATH.Then

bash calvin_finetune_obs.sh

8. Eval on RoboTwin¶

You need to modify the corresponding path in the deploy_policy.yml file: 1. model_path : Path to the checkpoint. 2. action_stat : Path to dataset_statistic.yaml.

bash eval.sh ${gpu_id}
# bash eval.sh 0

The evaluation results, including videos, will be saved in the eval_result directory under the project root.

9. Citation¶

If you find our works useful for your research and applications, please cite using these BibTeX:

@article{pdvla,
  title={Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding},
  author={Song, Wenxuan and Chen, Jiayi and Ding, Pengxiang and Zhao, Han and Zhao, Wei and Zhong, Zhide and Ge, Zongyuan and Ma, Jun and Li, Haoang},
  journal={arXiv preprint arXiv:2503.02310},
  year={2025}
}