Archival. JSONL dump of the exhentai metadata from the community crawl.
UNLICENSE License
Archival JSONL dump of the exhentai metadata from the community crawl. In case you want to do some machine learning with it.
The dataset is compressed and can be found in data/
. The current version
contains the no CG, no cosplay
pages.
An uncompressed 100
-record sample can be found in data/sample-100.jsonl
.
Total galleries included: 576056
Total translated: 209095
Artist | Galleries |
---|---|
mizuryu kei | 837 |
nakajima yuka | 836 |
crimson | 768 |
saigado | 757 |
cle masahiro | 734 |
itaba hiroshi | 641 |
sanbun kyoden | 621 |
shiwasu no okina | 601 |
yuzuki n dash | 592 |
inochi wazuka | 572 |
Language | Galleries |
---|---|
japanese | 309041 |
english | 97687 |
chinese | 51711 |
korean | 36013 |
spanish | 26756 |
russian | 14119 |
portuguese | 8904 |
french | 8248 |
speechless | 6500 |
thai | 4714 |
Female: big breasts (162918), lolicon (114988), sole female (101170), stockings (84915), schoolgirl uniform (73813)
Male: sole male (89600), shotacon (61892, yaoi (43680), males only (32906), anal (30191)
Misc: group (89827), full color (70355), incest (40864), mosaic censorship (30386), tankoubon (28461)
id
: Gallery id in the form of g/*/*
thumb
: URL of gallery thumbnailtitle
: Title of gallerycategory
: Gallery typeuploader
: Uploader display namecreated
: Gallery post time Y-m-d H:i
pages
: Number of images in gallerytags
: List of tags
namespace
: Namespace of tag (for misc:
this will be blank)tag
: Tagconfident
: 1 if the tag passed the power threshold (solid border) otherwise 0Certain files from the community dump were blacklisted because they were the wrong format (in "Minimal" view). These are ignored because the current script cannot handle the minimal page format and these pages do not have tag information for the galleries which arguably is the whole point of this export.