A Datasette instance for searching through 10m videos included in the WebVid-10M training set used for the Make-A-Video model by Meta AI.
Browse and search the videos at https://webvid.datasette.io/webvid/videos
More on this project: Exploring 10m scraped Shutterstock videos used to train Meta’s Make-A-Video text-to-video model
I downloaded the CSV file (2.7 GB) like this:
wget http://www.robots.ox.ac.uk/~maxbain/webvid/results_10M_train.csv
Then I loaded it into a webvid-full.db
SQLite database using this trick like so:
sqlite3 webvid-full.db <<EOS
.mode csv
.import results_10M_train.csv videos
EOS
The CSV file turned out to have a small number of duplicates (based on video ID) - videos which were in there more than once due to being crawled with an updated caption.
I used this SQL query to identify those:
select videoid, contentUrl, duration, page_dir, name, count(*)
from videos group by videoid
having count(*) > 1
I then created a smaller webvid.db
database using a number of steps now contained in the build-webvid-db.sh file.
I published the result to Fly as a custom Docker container.