Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Tianjin University

Abstract

Generalization in embodied AI is hindered by the "seeing-to-doing gap", stemming from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

Embodied-R1 Framework
Overview of Embodied-R1 Framework. Embodied-R1 is a 3B vision-language model (VLM) designed for general robotic manipulation. Through an innovative "Pointing" mechanism and Reinforced Fine-tuning (RFT) training methodology, it effectively bridges the "seeing-to-doing" gap in robotics, achieving remarkable zero-shot generalization capabilities.


Results

Real-world Robot Manipulation Demonstrations

Embodied-R1 demonstrates superior performance in real-world robot manipulation tasks, achieving an 87.5% success rate across 8 diverse tasks in a zero-shot setting. Below are visualizations of the model performing these tasks.

Real-world experimental evaluation results
Real-world experimental evaluation results.

Task 1: Pick up the strawberry

Task 2: Move the egg to the bowl

Task 3: Move the vise to the red basket

Task 4: Place the fork in the green bin

Task 5: Pick the [x] toothbrush and place it to the bucket

Task 6: Move the nearest object to the right side of the drawer

Task 7: Put the screwdriver between drawer and the vase

Task 8: Move the moka pot to the right of drawer



Visualization of Performance on Various Pointing Tasks

Embodied-R1 Visualization
Visualizing Embodied-R1's Performance on Various Pointing Tasks. The model can follow diverse text instructions and generalize its capabilities to novel, unseen environments.

Additional Embodied-R1 Results


Robustness to Visual Disturbances

Visual Disturbance Results
Embodied-R1's performance under different visual disturbance conditions. The model maintains 100% success rate under original conditions and background changes, with 83% success rate under combined background, light, and height changes.

Original

Background Change

Background+Light Change

Background+Light+Height Change

The process of Embodied-R1 performing Task 6 under different visual disturbances.

BibTeX

@misc{yuan2025embodiedr1reinforcedembodiedreasoning,
            title={Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation}, 
            author={Yifu Yuan and Haiqin Cui and Yaoting Huang and Yibin Chen and Fei Ni and Zibin Dong and Pengyi Li and Yan Zheng and Jianye Hao},
            year={2025},
            eprint={2508.13998},
            archivePrefix={arXiv},
            primaryClass={cs.RO},
            url={https://arxiv.org/abs/2508.13998}, 
      }