ustcwhy / BitVLA
Official implementation for BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
README
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
- March 2026: ๐ We have released the pre-trained BitVLA model! The evaluation results in the table below have been updated to reflect the performance after pre-training. You can try our new pre-trained model out-of-the-box.
- June 2025: BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Open Source Plan
- โ Paper, Pre-trained VLM and evaluation code.
- โ Fine-tuned VLA code and models
- โ Pre-trained VLA.
- ๐งญ Pre-training code
Contents
Checkpoints
| Models | Size | Memory Usageโ | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal | LIBERO-Long | Avg. |
|---|---|---|---|---|---|---|---|
| Large Models | |||||||
| OpenVLA | 7.5B | 15.1GB (10.79ร) | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| CoT-VLA | 8.0B | 16.2GB (11.57ร) | 87.5 | 91.6 | 87.6 | 69.0 | 81.1 |
| UniVLA | 8.5B | 17.0GB (12.14ร) | 96.5 | 96.8 | 95.6 | 92.0 | 95.2 |
| UnifiedVLA | 8.5B | 17.0GB (12.14ร) | 95.4 | 98.8 | 93.6 | 94.0 | 95.5 |
| OpenVLA-OFT | 7.7B | 15.4GB (11.00ร) | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| Small Models | |||||||
| SpatialVLA | 4.2B | 8.5GB (6.07ร) | 88.2 | 89.9 | 78.6 | 55.5 | 78.1 |
| NORA-Long | 3.8B | 7.5GB (5.36ร) | 92.2 | 95.4 | 89.4 | 74.6 | 87.9 |
| 4D-VLA | 4.1B | 8.3GB (5.93ร) | 88.9 | 95.2 | 90.9 | 79.1 | 88.6 |
| SmolVLA | 2.3B | 4.6GB (3.29ร) | 93.0 | 94.0 | 91.0 | 77.0 | 88.8 |
| GROOT-N1 | 2.2B | 4.4GB (3.14ร) | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 |
| ฯโ | 3.5B | 7.0GB (5.00ร) | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 |
| BitVLA w/o pre-training | 3.0B | 1.4GB (1.00ร) | 97.4 | 99.6 | 94.4 | 87.6 | 94.8 |
| ๐BitVLA | 3.0B | 1.4GB (1.00ร) | 96.6 | 99.0 | 95.4 | 92.8 | 96.0 |
| Model | Path |
|---|---|
| ๐BitVLA - VL&VLA pre-trained | lxsy/bitvla-bf16 |
| BitVLA - VL pre-trained | hongyuw/bitvla-bitsiglipL-224px-bf16 |
| BitVLA finetuned on LIBERO-Spatial | hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 |
| BitVLA finetuned on LIBERO-Object | hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16 |
| BitVLA finetuned on LIBERO-Goal | hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16 |
| BitVLA finetuned on LIBERO-Long | hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16 |
| BitVLA w/ BF16 SigLIP | hongyuw/bitvla-siglipL-224px-bf16 |
Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost. A dedicated inference framework and model are coming soon.*
Vision-Language
Evaluation on VQA
We use the LMM-Eval toolkit to conduct evaluations on VQA tasks. We provide the transformers repo in which we modify the modeling_llava.py and modeling_siglip.py to support the W1.58-A8 quantization.
The evaluation should use nvidia_24_07 docker. Install the packages:
docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity # only use for multimodal evaluation
docker exec -it nvidia_24_07 bash
git clone https://github.com/ustcwhy/BitVLA.git
cd BitVLA/
bash vl_eval_setup.sh # only use for multimodal evaluation
First, download the BitVLA model from HuggingFace:
git clone https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16 # BitVLA w/ W1.58-A8 SigLIP-L
git clone https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16 # BitVLA w/ BF16 SigLIP-L
Then run the following scripts to conduct evaluations:
cd lmms-eval/
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-bitsiglipL-224px-bf16
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16
Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost.
Vision-Language-Action
Robotics Pre-training
To endow BitVLA with generalizable manipulation priors that transfer across embodiments and environments, we pre-train it with an autoregressive next-action prediction objective following OpenVLA.
Pre-training Details:
- Base model: We use hongyuw/bitvla-bitsiglipL-224px-bf16 as the base model.
- Dataset: Following OpenVLA, we use a curated large-scale corpus based on a subset of the Open X-Embodiment dataset, resulting in ~1M training samples.
- Hyperparameters: We train the model for 200K steps with a total batch size of 2048. The peak learning rates are set to 3ร10โปโด for the LLM and 1ร10โปโด for the ViT.
- Compute: The full pre-training takes approximately 14 days on 16 NVIDIA H800 (80GB) GPUs.
OFT Training
1. Preparing OFT
We fine-tune BitVLA using OFT training shown in OpenVLA-OFT. First setup the environment as required by that project. You can refer to SETUP.md and LIBERO.md for detailed instructions.
conda create -n bitvla python=3.10 -y
conda activate bitvla
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
# or use the provided docker
# docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity
cd BitVLA
pip install -e openvla-oft/
pip install -e transformers
cd openvla-oft/
# install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO/
pip install -r experiments/robot/libero/libero_requirements.txt
# install bitvla
pip install -e bitvla/
We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from HuggingFace.
git clone [email protected]:datasets/openvla/modified_libero_rlds
2. OFT fine-tuning
Prepare the BitVLA
- ๐ New pre-trained model (Recommended): This model is ready to use out-of-the-box. No additional processing is required, and you can directly execute our provided scripts.
- ๐ฐ๏ธ Old model: This version was not pre-trained on the Open X-Embodiment dataset. To use this model, you must first convert the model into a format compatible with our codebase before using it.
python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
Fine-tuing the BitVLA
After that, you can finetune BitVLA using the provided shell script.
sh ft_script/ft_bitvla_libero_spatial.sh
sh ft_script/ft_bitvla_libero_object.sh
sh ft_script/ft_bitvla_libero_goal.sh
sh ft_script/ft_bitvla_libero_long.sh
Evaluation on LIBERO
You can download our fine-tuned BitVLA models from HuggingFace. As an example for spatial set in LIBERO, run the following script for evaluation:
python experiments/robot/libero/run_libero_eval_bitnet.py \
--pretrained_checkpoint /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
--task_suite_name libero_spatial \
--info_in_path "information you want to show in path" \
--model_family "bitnet"
Acknowledgement
This repository is built using LMM-Eval, the HuggingFace's transformers, OpenVLA-OFT and OpenVLA.
Citation
If you find this repository useful, please consider citing our work:
@article{bitvla,
title={BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation},
author={Hongyu Wang and Chuyan Xiong and Ruiping Wang and Xilin Chen},
year={2025},
eprint={2506.07530},
archivePrefix={arXiv},
primaryClass={cs.RO},
}
License
This project is licensed under the MIT License.
Contact Information
For help or issues using models, please submit a GitHub issue.
