--- license: mit language: - en base_model: - microsoft/Florence-2-large pipeline_tag: robotics tags: - VLA - LIBERO - Robotics - Flow --- # FlowerVLA - Vision-Language-Action Flow Model finetuned on LIBERO 10 This is a pretrained FlowerVLA model for robotic manipulation trained on the LIBERO 10 dataset. Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters. ## Model Description FlowerVLA is a novel architecture that: - Uses half of Florence-2 for multi-modal vision-language encoding - Employs an novel transformer-based flow matching architecture - Provides an efficient, versatile VLA policy with only ~1B parameters ## Model Performance This checkpoint contains weights for the LIBERO 10 challenge and achieves these results: eval_lh/avg_seq_len success rate 0.9440705180168152 eval_lh/sr_LIVING_ROOM_SCENE2_put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket with success 0.9791666666666666 eval_lh/sr_LIVING_ROOM_SCENE2_put_both_the_cream_cheese_box_and_the_butter_in_the_basket with success 1.0 eval_lh/sr_KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it with success 0.9791666666666666 eval_lh/sr_KITCHEN_SCENE4_put_the_black_bowl_in_the_bottom_drawer_of_the_cabinet_and_close_it with success 1.0 eval_lh/sr_LIVING_ROOM_SCENE5_put_the_white_mug_on_the_left_plate_and_put_the_yellow_and_white_mug_on_the_right_plate with success 0.9407051282051282 eval_lh/sr_STUDY_SCENE1_pick_up_the_book_and_place_it_in_the_back_compartment_of_the_caddy with success 1.0 eval_lh/sr_LIVING_ROOM_SCENE6_put_the_white_mug_on_the_plate_and_put_the_chocolate_pudding_to_the_right_of_the_plate with success 0.8990384615384616 eval_lh/sr_LIVING_ROOM_SCENE1_put_both_the_alphabet_soup_and_the_cream_cheese_box_in_the_basket with success 1.0 eval_lh/sr_KITCHEN_SCENE8_put_both_moka_pots_on_the_stove with success 0.7403846153846154 eval_lh/sr_KITCHEN_SCENE6_put_the_yellow_and_white_mug_in_the_microwave_and_close_it with success 0.9022435897435898 ### Input/Output Specifications #### Inputs - RGB Static Camera: `(B, T, 3, H, W)` tensor - RGB Gripper Camera: `(B, T, 3, H, W)` tensor - Language Instructions: Text strings #### Outputs - Action Space: `(B, T, 7)` tensor representing delta EEF actions ## Usage Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments. ```python obs = { "rgb_obs": { "rgb_static": static_image, "rgb_gripper": gripper_image } } 10 = {"lang_text": "pick up the blue cube"} action = model.step(obs, 10) ``` ## Training Details ### Configuration - **Optimizer**: AdamW - **Learning Rate**: 2e-5 - **Weight Decay**: 0.05 @inproceedings{ reuss2025flower, title={{FLOWER}: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models}, author={Moritz Reuss and Hongyi Zhou and Marcel R{\"u}hle and {\"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Fabian Otto and Rudolf Lioutikov}, booktitle={9th Annual Conference on Robot Learning}, year={2025}, url={https://openreview.net/forum?id=JeppaebLRD} } ## License This model is released under the MIT license.