Skip to content

microsoft/OmniParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Logo

arXiv License

๐Ÿ“ข [Project Page] [V2 Blog Post] [Models V2] [Models V1.5] [HuggingFace Space Demo]

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

News

  • [2025/3] We support local logging of trajecotry so that you can use OmniParser+OmniTool to build training data pipeline for your favorate agent in your domain. [Documentation WIP]
  • [2025/3] We are gradually adding multi agents orchstration and improving user interface in OmniTool for better experience.
  • [2025/2] We release OmniParser V2 checkpoints. Watch Video
  • [2025/2] We introduce OmniTool: Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. Watch Video
  • [2025/1] V2 is coming. We achieve new state of the art results 39.5% on the new grounding benchmark Screen Spot Pro with OmniParser v2 (will be released soon)! Read more details here.
  • [2024/11] We release an updated version, OmniParser V1.5 which features 1) more fine grained/small icon detection, 2) prediction of whether each screen element is interactable or not. Examples in the demo.ipynb.
  • [2024/10] OmniParser was the #1 trending model on huggingface model hub (starting 10/29/2024).
  • [2024/10] Feel free to checkout our demo on huggingface space! (stay tuned for OmniParser + Claude Computer Use)
  • [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! Hugginface models
  • [2024/09] OmniParser achieves the best performance on Windows Agent Arena!

Install

First clone the repo, and then install environment:

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:

   # download the model checkpoints to local directory OmniParser/weights/
   for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
   mv weights/icon_caption weights/icon_caption_florence

Examples:

We put together a few simple examples in the demo.ipynb.

Gradio Demo

To run gradio demo, simply run:

python gradio_demo.py

Model Weights License

For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.

๐Ÿ“š Citation

Our technical report can be found here. If you find our work useful, please consider citing our work:

@misc{lu2024omniparserpurevisionbased,
      title={OmniParser for Pure Vision Based GUI Agent}, 
      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
      year={2024},
      eprint={2408.00203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00203}, 
}

About

A simple screen parsing tool towards pure vision based GUI agent

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy