Skip to content

DexVLA (Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning)

Contributed by Midea Group

1. Install

To guarantee clean isolation between training and evaluation environments for both DexVLA and TinyVLA, we provide two distinct, self-contained setups.The training and testing environment can be used for both DexVLA and TinyVLA.

Training Environment:

cd policy/DexVLA
conda env create -f Train_Tiny_DexVLA_train.yml
conda activate dexvla-robo
cd policy_heads
pip install -e .
Evaluation Environment:

If you already have RoboTwin 2.0 installed, activate this conda environment and add the evaluation dependencies:

conda activate your_RoboTwin_env
pip install -r Eval_Tiny_DexVLA_requirements.txt 

2. Prepare Training Data

This step performs data preprocessing, converting the original RoboTwin 2.0 data into the format required for DexVLA training. The expert_data_num parameter specifies the number of trajectory pairs to be used as training data.

python process_data.py ${task_name} ${task_config} ${expert_data_num}
# python process_data.py beat_block_hammer demo_randomized 50
If success, you will find the data in the policy/Dexvla/data/sim_${task_name}/${setting}_${expert_data_num} folder.

3. Train Policy

This step launches the training process.

3.1 Download official Qwen2_VL weights

We construct the VLM backbone by integrating Qwen2-VL-2B.You can download the official weights from this link:

Model Link
Qwen2-VL (~2B) huggingface

❗❗ After downloading the standard weights, you have to modify the official config.json file in the folder. Please update the 'architectures' field from "Qwen2VLForConditionalGenerationForVLA" to "DexVLA", and change the 'model_type' field from "qwen2_vla" to "dex_vla".

3.2 Download our pretrained ScaleDP-H weights

We released our pretrained weights of ScaleDP-H which is trained after Stage1. Now you can download the weights and directly finetuning your data on Stage 2.

Model Link
ScaleDP-H (~1B) huggingface
ScaleDP-L (~400M) huggingface
### 3.3 Train
The training script are "scripts/aloha/vla_stage2_train.sh". And you need to change following parameters:
1. OUTPUT : refers to the save directory for training, which must include the keyword "qwen2" (and optionally "lora"). If LoRA training is used, the name must include "lora" (e.g., "qwen2_lora").
2. TASKNAME : refers to the tasks used for training, which should be corresponded to "your_task_name" in aloha_scripts/constant.py
3. mnop : path to the pretrained VLM weights
4. load_pretrain_dit : True
5. DIT_PRETRAIN :Path to pretrained policy head (ScaleDP).

Other hyperparameters like "batch_size", "save_steps" could be customized according to your computation resources.

Start training by following commands:

bash ./scripts/aloha/vla_stage2_train.sh

4. Eval Policy

You need to modify the corresponding path in the deploy_policy.yml file: 1. model_path : Path to the trained model, in the OUTPUT path. 2. state_path : Path to dataset_stats.pkl, in the OUTPUT path.

Then execute:

bash eval.sh ${task_name} ${task_config} ${ckpt_setting} ${expert_data_num} ${seed} ${gpu_id}
# bash eval.sh beat_block_hammer demo_randomized 0 50 0 0

5. Citation

If you find our works useful for your research and applications, please cite using these BibTeX:

5.1 DexVLA

@article{wen2025dexvla,
  title={DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control},
  author={Wen, Junjie and Zhu, Yichen and Li, Jinming and Tang, Zhibin and Shen, Chaomin and Feng, Feifei},
  journal={arXiv preprint arXiv:2502.05855},
  year={2025}
}

5.2 DiffusionVLA

@article{wen2024diffusion,
  title={Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression},
  author={Wen, Junjie and Zhu, Minjie and Zhu, Yichen and Tang, Zhibin and Li, Jinming and Zhou, Zhongyi and Li, Chengmeng and Liu, Xiaoyu and Peng, Yaxin and Shen, Chaomin and others},
  journal={arXiv preprint arXiv:2412.03293},
  year={2024}
}

5.3 ScaleDP

@article{zhu2024scaling,
  title={Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation},
  author={Zhu, Minjie and Zhu, Yichen and Li, Jinming and Wen, Junjie and Xu, Zhiyuan and Liu, Ning and Cheng, Ran and Shen, Chaomin and Peng, Yaxin and Feng, Feifei and others},
  journal={arXiv preprint arXiv:2409.14411},
  year={2024}
}