A task generation and model evaluation system.
APACHE-2.0 License
🔥[2024-09-26]: Task Me Anything got accepted by NeurIPS 2024 Dataset & Benchmark track!
🔥[2024-08-03]: TaskMeAnything-v1-2024 released! A benchmark for reflecting the current progress of MLMs by automatically
finding tasks that popular MLMs struggle with using the TaskMeAnything Top-K query and query approximation algorithms
. This includes 12,270 ImageQA and 3,567 VideoQA questions that TaskMeAnything automatically approximated as challenging.
🔥[2024-07-04]: Demo for TaskMeAnything released! checkout our demo for generating customized ImageQa, VideoQA benchmarks and model evaluation query!
🔥[2024-06-17]: Paper arXived!
🔥[2024-06-01]: Code released!
TaskMeAnything is a benchmark generation engine which produces a benchmark for large multimodal language models (MLMs) tailored to a user's needs. In particular, TaskMeAnything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. The current version can generate > 750M image/video question-answering pairs, which focus on evaluating MLM perceptual capabilities.
❗ TaskMeAnything does NOT involve any AI model during image/video, question, and answer generation, so the generated tasks do NOT suffer from model imperfection or hallucinations.
We release the following resources:
automatically
finding tasks that popular MLMs struggle with using the TaskMeAnything Top-K query and query approximation algorithms. This includes 12,270 ImageQA and 3,567 VideoQA questions that TaskMeAnything automatically approximated as challenging for over 20 popular MLMs.Demo for TaskMeAnything released! checkout our demo for
Notice: If you want to evaluate videoqa models, please check our videoqa model branch
You can easily download the repo and set up the environments via:
git clone https://github.com/JieyuZ2/TaskMeAnything.git
cd ./TaskMeAnything
pip install -r requirements.txt
Notice: if you want to render 3D images/videos by Blender
locally or use Internvl-chat-v1.5-24B
that required flash-attn
which hard to install by pip, you can use the docker image we provide.
You can pull the docker image from DockerHub which includes all the dependencies like Blender
, flash-attn
, cuda driver
, nvcc
, etc.
docker pull weikaih/ubuntu20.4_internvl_blender_v1.2:latest
docker run --gpus all -it weikaih/ubuntu20.4_internvl_blender_v1.2:latest /bin/bash # run the docker image with GPU support
git clone https://github.com/JieyuZ2/TaskMeAnything.git
cd ./TaskMeAnything
pip install -r requirements.txt
Source data is stored in HuggingFace. It includes 3d_assets
, agqa_video
, and object_images
.
For real image with scene graphs, please download the images and scene graphs from the following links: SceneGraph, Image. After downloading, move the scene graphs and images into the source data folder, and arrange them as format below.
TaskMeAnything-v1-source/vg/sceneGraphs: move scene graphs files to this folder (e.g. TaskMeAnything-v1-source/vg/sceneGraphs/train_sceneGraphs.json).
TaskMeAnything-v1-source/vg/images/images: move all the images to this folder (e.g. TaskMeAnything-v1-source/vg/images/images/2323739.jpg).
We have 28 task generators in TaskMeAnything-v1, across 5 Scenarios:
2D Sticker Image
: grid-how-many, grid-what, grid-where, grid-what-attribute, grid-where-attribute3D Tabletop Image
: 3d-what, 3d-where, 3d-what-attribute, 3d-where-attribute, 3d-how-many, 3d-what-size, 3d-where-size, 3d-what-attribute-size, 3d-what-distance, 3d-where-distance, 3d-what-attribute-distance3D Tabletop Video
: video-3d-what-move, video-3d-where-move, video-3d-what-attribute-move, video-3d-what-rotate, video-3d-where-rotate, video-3d-what-attribute-rotateReal Images
: sg-what-object, sg-what-relation, sg-what-attributeReal Videos
: video-sg-what-object, video-sg-what-relation, video-sg-what-actionWe support the following ImageQA and VideoQA models:
ImageQA
: qwenvl-chat, qwenvl, llavav1.5-7b, llavav1.5-13b, instructblip-vicuna7b, instructblip-vicuna13b, internvl-chat-v1.5, gemini-vision-pro, qwen-vl-max, gpt4v, gpt4oVideoQA
: video-llama2-7b, video-llama2-13b, video-llava-7b, chat-univi-7b, chat-univi-13b, video-chatgpt-7b, video-chat2-7bYou can also use our unified vqa interface for inference:
from PIL import Image
from tma.models.qa_model import ImageQAModel
# from tma.models.qa_model.prompt import succinct_prompt
from tma.models.qa_model.prompt import detailed_imageqa_prompt
model = ImageQAModel(
model_name= "llava-v1.5-7b",
prompt_name= "detailed",
prompt_func= detailed_imageqa_prompt
)
image = './path/to/image.jpg'
# or image = Image.open(image_path)
question = "Describe the image."
model.qa(image, question)
Or check videoqa model branch for videoqa models qa inference.
Currently, we provide two versions of TaskMeAnything-v1 benchmark:
automatically
finding tasks that popular MLMs struggle with using the TaskMeAnything Top-K query and query approximation algorithms.import datasets
dataset_name = 'weikaih/TaskMeAnything-v1-imageqa-random'
#dataset_name = 'weikaih/TaskMeAnything-v1-imageqa-2024'
dataset = datasets.load_dataset(dataset_name, split = TASK_GENERATOR_SPLIT)
where TASK_GENERATOR_SPLIT
is one of the task generators, eg, 2d_how_many
.
import datasets
dataset_name = 'weikaih/TaskMeAnything-v1-videoqa-random'
#dataset_name = 'weikaih/TaskMeAnything-v1-videoqa-2024'
dataset = datasets.load_dataset(dataset_name, split = TASK_GENERATOR_SPLIT)
# example: convert binary stream in dataset to .mp4 files
video_binary = dataset[0]['video']
with open('/path/save/video.mp4', 'wb') as f:
f.write(video_binary)
For more details, please check out the paper.
TaskMeAnything-DB are stored in HuggingFace
TaskMeAnything-UI are hosted in HuggingFace, check out our interactive interface to explore the performance of models on TaskMeAnything-v1 in your own way!
TaskMeAnything and its associated resources are provided for research and educational purposes only. The authors and contributors make no warranties regarding the accuracy or reliability of the data and software. Users are responsible for ensuring their use complies with applicable laws and regulations. The project is not liable for any damages or losses resulting from the use of these resources.
BibTeX:
@article{zhang2024task,
title={Task Me Anything},
author={Zhang, Jieyu and Huang, Weikai and Ma, Zixian and Michel, Oscar and He, Dong and Gupta, Tanmay and Ma, Wei-Chiu and Farhadi, Ali and Kembhavi, Aniruddha and Krishna, Ranjay},
journal={arXiv preprint arXiv:2406.11775},
year={2024}
}