
🌐 Website   | 🤗 Hugging Face Models  
|    🔧 Deployment    |    📑 Paper    |  
🖥️ UI-TARS-desktop  
🏄 Midscene (Browser Automation)    |   🫨 Discord  
We also offer a UI-TARS-desktop version, which can operate on your local personal device. To use it, please visit https://github.com/bytedance/UI-TARS-desktop. To use UI-TARS in web automation, you may refer to the open-source project Midscene.js.
UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.
undefinedOnline Benchmark Evaluationundefined
| Benchmark type | Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
|---|---|---|---|---|---|
| undefinedComputer Useundefined | OSworld (100 steps) | undefined42.5undefined | 36.4 | 28 | 38.1 (200 step) |
| Windows Agent Arena (50 steps) | undefined42.1undefined | - | - | 29.8 | |
| undefinedBrowser Useundefined | WebVoyager | 84.8 | undefined87undefined | 84.1 | 87 |
| Online-Mind2web | undefined75.8undefined | 71 | 62.9 | 71 | |
| undefinedPhone Useundefined | Android World | undefined64.2undefined | - | - | 59.5 |
undefinedGrounding Capability Evaluationundefined
| Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
|---|---|---|---|---|
| ScreenSpot-V2 | undefined94.2undefined | 87.9 | 87.6 | 91.6 |
| ScreenSpotPro | undefined61.6undefined | 23.4 | 27.7 | 43.6 |
undefinedPoki Gameundefined
| Model | 2048 | cubinko | energy | free-the-key | Gem-11 | hex-frvr | Infinity-Loop | Maze:Path-of-Light | shapes | snake-solver | wood-blocks-3d | yarn-untangle | laser-maze-puzzle | tiles-master |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI CUA | 31.04 | 0.00 | 32.80 | 0.00 | 46.27 | 92.25 | 23.08 | 35.00 | 52.18 | 42.86 | 2.02 | 44.56 | 80.00 | 78.27 |
| Claude 3.7 | 43.05 | 0.00 | 41.60 | 0.00 | 0.00 | 30.76 | 2.31 | 82.00 | 6.26 | 42.86 | 0.00 | 13.77 | 28.00 | 52.18 |
| UI-TARS-1.5 | 100.00 | 0.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
undefinedMinecraftundefined
| Task Type | Task Name | VPT | DreamerV3 | Previous SOTA | UI-TARS-1.5 w/o Thought | UI-TARS-1.5 w/ Thought |
|---|---|---|---|---|---|---|
| Mine Blocks | (oak_log) | 0.8 | 1.0 | 1.0 | 1.0 | 1.0 |
| (obsidian) | 0.0 | 0.0 | 0.0 | 0.2 | 0.3 | |
| (white_bed) | 0.0 | 0.0 | 0.1 | 0.4 | 0.6 | |
| undefined200 Tasks Avg.undefined | 0.06 | 0.03 | 0.32 | 0.35 | 0.42 | |
| Kill Mobs | (mooshroom) | 0.0 | 0.0 | 0.1 | 0.3 | 0.4 |
| (zombie) | 0.4 | 0.1 | 0.6 | 0.7 | 0.9 | |
| (chicken) | 0.1 | 0.0 | 0.4 | 0.5 | 0.6 | |
| undefined100 Tasks Avg.undefined | 0.04 | 0.03 | 0.18 | 0.25 | 0.31 |
Here we compare performance across different model scales of UI-TARS on the OSworld benchmark.
| undefinedBenchmark Typeundefined | undefinedBenchmarkundefined | undefinedUI-TARS-72B-DPOundefined | undefinedUI-TARS-1.5-7Bundefined | undefinedUI-TARS-1.5undefined |
|---|---|---|---|---|
| Computer Use | OSWorld | 24.6 | 27.5 | undefined42.5undefined |
| GUI Grounding | ScreenSpotPro | 38.1 | 49.6 | undefined61.6undefined |
While UI-TARS-1.5 represents a significant advancement in multimodal agent capabilities, we acknowledge several important limitations:
We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at TARS@bytedance.com.
Looking ahead, we envision UI-TARS evolving into increasingly sophisticated agentic experiences capable of performing real-world actions, thereby empowering platforms such as doubao to accomplish more complex tasks for you :)
If you find our paper and model useful in your research, feel free to give us a cite.
@article{qin2025ui,
title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
journal={arXiv preprint arXiv:2501.12326},
year={2025}
}
We use cookies
We use cookies to analyze traffic and improve your experience. You can accept or reject analytics cookies.