---
license: mit
language:
- en
base_model:
- microsoft/Florence-2-large
pipeline_tag: robotics
tags:
- VLA
- LIBERO
- Robotics
- Flow
---
# FlowerVLA - Vision-Language-Action Flow Model finetuned on LIBERO 10

This is a pretrained FlowerVLA model for robotic manipulation trained on the LIBERO 10 dataset. 
Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.

## Model Description

FlowerVLA is a novel architecture that:
- Uses half of Florence-2 for multi-modal vision-language encoding
- Employs an novel transformer-based flow matching architecture 
- Provides an efficient, versatile VLA policy with only ~1B parameters

## Model Performance

This checkpoint contains weights for the LIBERO 10 challenge and achieves these results:

eval_lh/avg_seq_len success rate 0.9440705180168152
eval_lh/sr_LIVING_ROOM_SCENE2_put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket with success 0.9791666666666666
eval_lh/sr_LIVING_ROOM_SCENE2_put_both_the_cream_cheese_box_and_the_butter_in_the_basket with success 1.0
eval_lh/sr_KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it with success 0.9791666666666666
eval_lh/sr_KITCHEN_SCENE4_put_the_black_bowl_in_the_bottom_drawer_of_the_cabinet_and_close_it with success 1.0
eval_lh/sr_LIVING_ROOM_SCENE5_put_the_white_mug_on_the_left_plate_and_put_the_yellow_and_white_mug_on_the_right_plate with success 0.9407051282051282
eval_lh/sr_STUDY_SCENE1_pick_up_the_book_and_place_it_in_the_back_compartment_of_the_caddy with success 1.0
eval_lh/sr_LIVING_ROOM_SCENE6_put_the_white_mug_on_the_plate_and_put_the_chocolate_pudding_to_the_right_of_the_plate with success 0.8990384615384616
eval_lh/sr_LIVING_ROOM_SCENE1_put_both_the_alphabet_soup_and_the_cream_cheese_box_in_the_basket with success 1.0
eval_lh/sr_KITCHEN_SCENE8_put_both_moka_pots_on_the_stove with success 0.7403846153846154
eval_lh/sr_KITCHEN_SCENE6_put_the_yellow_and_white_mug_in_the_microwave_and_close_it with success 0.9022435897435898


### Input/Output Specifications

#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings

#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions

## Usage

Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments.

```python
obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
10 = {"lang_text": "pick up the blue cube"}
action = model.step(obs, 10)
```

## Training Details

### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.05


@inproceedings{
  reuss2025flower,
  title={{FLOWER}: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models},
  author={Moritz Reuss and Hongyi Zhou and Marcel R{\"u}hle and {\"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Fabian Otto and Rudolf Lioutikov},
  booktitle={9th Annual Conference on Robot Learning},
  year={2025},
  url={https://openreview.net/forum?id=JeppaebLRD}
}


## License
This model is released under the MIT license.