//UI-TARSbyai-tools

UI-TARS

0
0
0

Local Image

🌐 Website   | 🤗 Hugging Face Models   |    🔧 Deployment    |    📑 Paper    |   🖥️ UI-TARS-desktop  
🏄 Midscene (Browser Automation)    |   🫨 Discord  

We also offer a UI-TARS-desktop version, which can operate on your local personal device. To use it, please visit https://github.com/bytedance/UI-TARS-desktop. To use UI-TARS in web automation, you may refer to the open-source project Midscene.js.

Updates

  • 🌟 2025.04.16: We shared the latest progress of the UI-TARS-1.5 model in our blog, which excels in playing games and performing GUI tasks, and we open-sourced the UI-TARS-1.5-7B.
  • ✨ 2025.03.23: We updated the OSWorld inference scripts from the original official OSWorld repository. Now, you can use the OSWorld official inference scripts to reproduce our results.

Introduction

UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.

Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.

Deployment

System Prompts

Performance

undefinedOnline Benchmark Evaluationundefined

Benchmark type Benchmark UI-TARS-1.5 OpenAI CUA Claude 3.7 Previous SOTA
undefinedComputer Useundefined OSworld (100 steps) undefined42.5undefined 36.4 28 38.1 (200 step)
Windows Agent Arena (50 steps) undefined42.1undefined - - 29.8
undefinedBrowser Useundefined WebVoyager 84.8 undefined87undefined 84.1 87
Online-Mind2web undefined75.8undefined 71 62.9 71
undefinedPhone Useundefined Android World undefined64.2undefined - - 59.5

undefinedGrounding Capability Evaluationundefined

Benchmark UI-TARS-1.5 OpenAI CUA Claude 3.7 Previous SOTA
ScreenSpot-V2 undefined94.2undefined 87.9 87.6 91.6
ScreenSpotPro undefined61.6undefined 23.4 27.7 43.6

undefinedPoki Gameundefined

Model 2048 cubinko energy free-the-key Gem-11 hex-frvr Infinity-Loop Maze:Path-of-Light shapes snake-solver wood-blocks-3d yarn-untangle laser-maze-puzzle tiles-master
OpenAI CUA 31.04 0.00 32.80 0.00 46.27 92.25 23.08 35.00 52.18 42.86 2.02 44.56 80.00 78.27
Claude 3.7 43.05 0.00 41.60 0.00 0.00 30.76 2.31 82.00 6.26 42.86 0.00 13.77 28.00 52.18
UI-TARS-1.5 100.00 0.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

undefinedMinecraftundefined

Task Type Task Name VPT DreamerV3 Previous SOTA UI-TARS-1.5 w/o Thought UI-TARS-1.5 w/ Thought
Mine Blocks (oak_log) 0.8 1.0 1.0 1.0 1.0
(obsidian) 0.0 0.0 0.0 0.2 0.3
(white_bed) 0.0 0.0 0.1 0.4 0.6
undefined200 Tasks Avg.undefined 0.06 0.03 0.32 0.35 0.42
Kill Mobs (mooshroom) 0.0 0.0 0.1 0.3 0.4
(zombie) 0.4 0.1 0.6 0.7 0.9
(chicken) 0.1 0.0 0.4 0.5 0.6
undefined100 Tasks Avg.undefined 0.04 0.03 0.18 0.25 0.31

Model Scale Comparison

Here we compare performance across different model scales of UI-TARS on the OSworld benchmark.

undefinedBenchmark Typeundefined undefinedBenchmarkundefined undefinedUI-TARS-72B-DPOundefined undefinedUI-TARS-1.5-7Bundefined undefinedUI-TARS-1.5undefined
Computer Use OSWorld 24.6 27.5 undefined42.5undefined
GUI Grounding ScreenSpotPro 38.1 49.6 undefined61.6undefined

Limitations

While UI-TARS-1.5 represents a significant advancement in multimodal agent capabilities, we acknowledge several important limitations:

  • undefinedMisuse: Given its enhanced performance in GUI tasks, including successfully navigating authentication challenges like CAPTCHA, UI-TARS-1.5 could potentially be misused for unauthorized access or automation of protected content. To mitigate this risk, extensive internal safety evaluations are underway.
  • undefinedComputation: UI-TARS-1.5 still requires substantial computational resources, particularly for large-scale tasks or extended gameplay scenarios.
  • undefinedHallucination: UI-TARS-1.5 may occasionally generate inaccurate descriptions, misidentify GUI elements, or take suboptimal actions based on incorrect inferences—especially in ambiguous or unfamiliar environments.
  • undefinedModel scale: The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.

What’s next

We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at TARS@bytedance.com.

Looking ahead, we envision UI-TARS evolving into increasingly sophisticated agentic experiences capable of performing real-world actions, thereby empowering platforms such as doubao to accomplish more complex tasks for you :)

Star History

Star History Chart

Citation

If you find our paper and model useful in your research, feel free to give us a cite.

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}
[beta]v0.14.0