ByteDance-Seed/UI-TARS-1.5-7B

image text to texttransformersentransformerssafetensorsqwen2_5_vlimage-to-textmultimodalguiapache-2.0
321.0K

UI-TARS-1.5 Model

We shared the latest progress of the UI-TARS-1.5 model in our blog, which excels in playing games and performing GUI tasks.

Introduction

UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.

Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.

Code: https://github.com/bytedance/UI-TARS

Application: https://github.com/bytedance/UI-TARS-desktop

Performance

Online Benchmark Evaluation

Benchmark typeBenchmarkUI-TARS-1.5OpenAI CUAClaude 3.7Previous SOTA
Computer UseOSworld (100 steps)42.536.42838.1 (200 step)
Windows Agent Arena (50 steps)42.1--29.8
Browser UseWebVoyager84.88784.187
Online-Mind2web75.87162.971
Phone UseAndroid World64.2--59.5

Grounding Capability Evaluation

BenchmarkUI-TARS-1.5OpenAI CUAClaude 3.7Previous SOTA
ScreensSpot-V294.287.987.691.6
ScreenSpotPro61.623.427.743.6

Poki Game

Model2048cubinkoenergyfree-the-keyGem-11hex-frvrInfinity-LoopMaze:Path-of-Lightshapessnake-solverwood-blocks-3dyarn-untanglelaser-maze-puzzletiles-master
OpenAI CUA31.040.0032.800.0046.2792.2523.0835.0052.1842.862.0244.5680.0078.27
Claude 3.743.050.0041.600.000.0030.762.3182.006.2642.860.0013.7728.0052.18
UI-TARS-1.5100.000.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00

Minecraft

Task TypeTask NameVPTDreamerV3Previous SOTAUI-TARS-1.5 w/o ThoughtUI-TARS-1.5 w/ Thought
Mine Blocks(oak_log)0.81.01.01.01.0
(obsidian)0.00.00.00.20.3
(white_bed)0.00.00.10.40.6
200 Tasks Avg.0.060.030.320.350.42
Kill Mobs(mooshroom)0.00.00.10.30.4
(zombie)0.40.10.60.70.9
(chicken)0.10.00.40.50.6
100 Tasks Avg.0.040.030.180.250.31

Model Scale Comparison

This table compares performance across different model scales of UI-TARS on the OSworld benchmark.

Benchmark TypeBenchmarkUI-TARS-72B-DPOUI-TARS-1.5-7BUI-TARS-1.5
Computer UseOSWorld24.627.542.5
GUI GroundingScreenSpotPro38.149.661.6

The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.

What's next

We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at TARS@bytedance.com.

Citation

If you find our paper and model useful in your research, feel free to give us a cite.

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}
DEPLOY IN 60 SECONDS

Run UI-TARS-1.5-7B on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.