Vision utilities for web interaction agents 👀
MIT License
Bot releases are hidden (Show)
Full Changelog: https://github.com/reworkd/tarsier/compare/v0.5.0...v0.6.0
Published by awtkns 11 months ago
Full Changelog: https://github.com/reworkd/tarsier/compare/v0.4.0...v0.5.0
Published by awtkns 11 months ago
Full Changelog: https://github.com/reworkd/tarsier/compare/v0.3.1...v0.4.0
Published by awtkns 11 months ago
If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:
At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects.
Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier!
The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.
https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b
Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]
.
In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon.
We define interactable elements as buttons, links, or input fields that are visible on the page.
Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs.
This is important to note given performance issues with existing vision language models.
Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.
Visit our cookbook for agent examples using Tarsier:
Otherwise, basic Tarsier usage might look like the following:
import asyncio
from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
async def main():
google_cloud_credentials = {}
ocr_service = GoogleVisionOCRService(google_cloud_credentials)
tarsier = Tarsier(ocr_service)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://news.ycombinator.com")
page_text, tag_to_xpath = await tarsier.page_to_text(page)
print(tag_to_xpath) # Mapping of tags to x_paths
print(page_text) # My Text representation of the page
if __name__ == '__main__':
asyncio.run(main())
Special shoutout to @KhoomeiK for making this happen! ❤️