wiki-entity-summarization

This repository hosts a comprehensive suite for graph-based entity summarization dataset generating from user-selected Wikipedia pages. Utilizing a series of interconnected modules, it leverages Wikidata and Wikipedia dumps to construct a dataset, alongside auto-generated ground truths.

CC-BY-4.0 License

Stars
21
Committers
4

Bot releases are hidden (Show)

wiki-entity-summarization - 1.0.5 Latest Release

Published by msorkhpar 4 months ago

We have generated datasets using A Brief History of Human Time project.
These datasets contain different sets of seed nodes, categorized by various human arts and professions.

dataset (variant, size, None/train/val/test) #roots #smmaries #nodes #edges #labels roots category distribution Running Time(sec)
WikiLitArt-s csv, graphml, croissant.json 494 10416 85346 136950 547 actor=150 composer=35 film=41 novelist=24 painter=59 poet=39 screenwriter=17 singer=72 writer=57 91.934
WikiLitArt-s-train csv, graphml, croissant.json 346 7234 61885 96497 508 actor=105 composer=24 film=29 novelist=17 painter=42 poet=27 screenwriter=12 singer=50 writer=40 66.023
WikiLitArt-s-val csv, graphml, croissant.json 74 1572 14763 20795 340 actor=23 composer=5 film=6 novelist=4 painter=9 poet=6 screenwriter=2 singer=11 writer=8 14.364
WikiLitArt-s-test csv, graphml, croissant.json 74 1626 15861 22029 350 actor=22 composer=6 film=6 novelist=3 painter=8 poet=6 screenwriter=3 singer=11 writer=9 14.6
WikiLitArt-m csv, graphml, croissant.json 494 10416 128061 220263 604 actor=150 composer=35 film=41 novelist=24 painter=59 poet=39 screenwriter=17 singer=72 writer=57 155.368
WikiLitArt-m-train csv, graphml, croissant.json 346 7234 93251 155667 566 actor=105 composer=24 film=29 novelist=17 painter=42 poet=27 screenwriter=12 singer=50 writer=40 111.636
WikiLitArt-m-val csv, graphml, croissant.json 74 1572 22214 33547 375 actor=23 composer=5 film=6 novelist=4 painter=9 poet=6 screenwriter=2 singer=11 writer=8 22.957
WikiLitArt-m-test csv, graphml, croissant.json 74 1626 24130 35980 394 actor=22 composer=6 film=6 novelist=3 painter=8 poet=6 screenwriter=3 singer=11 writer=9 26.187
WikiLitArt-l csv, graphml, croissant.json 494 10416 239491 466905 703 actor=150 composer=35 film=41 novelist=24 painter=59 poet=39 screenwriter=17 singer=72 writer=57 353.113
WikiLitArt-l-train csv, graphml, croissant.json 346 7234 176057 332279 661 actor=105 composer=24 film=29 novelist=17 painter=42 poet=27 screenwriter=12 singer=50 writer=40 244.544
WikiLitArt-l-val csv, graphml, croissant.json 74 1572 42745 71734 446 actor=23 composer=5 film=6 novelist=4 painter=9 poet=6 screenwriter=2 singer=11 writer=8 57.263
WikiLitArt-l-test csv, graphml, croissant.json 74 1626 46890 77931 493 actor=22 composer=6 film=6 novelist=3 painter=8 poet=6 screenwriter=3 singer=11 writer=9 60.466
WikiCinema-s csv, graphml, croissant.json 493 11750 70753 126915 469 actor=405 film=88 118.014
WikiCinema-s-train csv, graphml, croissant.json 345 8374 52712 89306 437 actor=284 film=61 84.364
WikiCinema-s-val csv, graphml, croissant.json 73 1650 13362 19280 305 actor=59 film=14 18.651
WikiCinema-s-test csv, graphml, croissant.json 75 1744 14777 21567 313 actor=62 film=13 19.851
WikiCinema-m csv, graphml, croissant.json 493 11750 101529 196061 541 actor=405 film=88 196.413
WikiCinema-m-train csv, graphml, croissant.json 345 8374 75900 138897 491 actor=284 film=61 142.091
WikiCinema-m-val csv, graphml, croissant.json 73 1650 19674 30152 344 actor=59 film=14 31.722
WikiCinema-m-test csv, graphml, croissant.json 75 1744 22102 34499 342 actor=62 film=13 33.674
WikiCinema-l csv, graphml, croissant.json 493 11750 185098 397546 614 actor=405 film=88 475.679
WikiCinema-l-train csv, graphml, croissant.json 345 8374 139598 284417 575 actor=284 film=61 333.148
WikiCinema-l-val csv, graphml, croissant.json 73 1650 37352 63744 412 actor=59 film=14 68.62
WikiCinema-l-test csv, graphml, croissant.json 75 1744 43238 74205 426 actor=62 film=13 87.07
WikiPro-s csv, graphml, croissant.json 493 9853 79825 125912 616 actor=58 football=156 journalist=14 lawyer=16 painter=23 player=25 politician=125 singer=27 sport=21 writer=28 126.119
WikiPro-s-train csv, graphml, croissant.json 345 6832 57529 87768 575 actor=41 football=109 journalist=10 lawyer=11 painter=16 player=17 politician=87 singer=19 sport=15 writer=20 89.874
WikiPro-s-val csv, graphml, croissant.json 74 1548 15769 21351 405 actor=9 football=23 journalist=2 lawyer=3 painter=3 player=4 politician=19 singer=4 sport=3 writer=4 21.021
WikiPro-s-test csv, graphml, croissant.json 74 1484 15657 21145 384 actor=8 football=24 journalist=2 lawyer=2 painter=4 player=4 politician=19 singer=4 sport=3 writer=4 21.743
WikiPro-m csv, graphml, croissant.json 493 9853 119305 198663 670 actor=58 football=156 journalist=14 lawyer=16 painter=23 player=25 politician=125 singer=27 sport=21 writer=28 208.157
WikiPro-m-train csv, graphml, croissant.json 345 6832 86434 138676 633 actor=41 football=109 journalist=10 lawyer=11 painter=16 player=17 politician=87 singer=19 sport=15 writer=20 141.563
WikiPro-m-val csv, graphml, croissant.json 74 1548 24230 34636 463 actor=9 football=23 journalist=2 lawyer=3 painter=3 player=4 politician=19 singer=4 sport=3 writer=4 36.045
WikiPro-m-test csv, graphml, croissant.json 74 1484 24117 34157 462 actor=8 football=24 journalist=2 lawyer=2 painter=4 player=4 politician=19 singer=4 sport=3 writer=4 36.967
WikiPro-l csv, graphml, croissant.json 493 9853 230442 412766 769 actor=58 football=156 journalist=14 lawyer=16 painter=23 player=25 politician=125 singer=27 sport=21 writer=28 489.409
WikiPro-l-train csv, graphml, croissant.json 345 6832 166685 290069 725 actor=41 football=109 journalist=10 lawyer=11 painter=16 player=17 politician=87 singer=19 sport=15 writer=20 334.864
WikiPro-l-val csv, graphml, croissant.json 74 1548 48205 74387 549 actor=9 football=23 journalist=2 lawyer=3 painter=3 player=4 politician=19 singer=4 sport=3 writer=4 84.089
WikiPro-l-test csv, graphml, croissant.json 74 1484 47981 72845 546 actor=8 football=24 journalist=2 lawyer=2 painter=4 player=4 politician=19 singer=4 sport=3 writer=4 92.545
WikiProFem-s csv, graphml, croissant.json 468 8338 79926 123193 571 actor=141 athletic=25 football=24 journalist=16 painter=16 player=32 politician=81 singer=69 sport=18 writer=46 177.63
WikiProFem-s-train csv, graphml, croissant.json 330 5587 58329 87492 521 actor=98 athletic=18 football=17 journalist=9 painter=13 player=22 politician=57 singer=48 sport=14 writer=34 127.614
WikiProFem-s-val csv, graphml, croissant.json 68 1367 14148 19360 344 actor=21 athletic=4 football=3 journalist=4 painter=1 player=5 politician=13 singer=11 sport=1 writer=5 29.081
WikiProFem-test csv, graphml, croissant.json 70 1387 13642 18567 360 actor=22 athletic=3 football=4 journalist=3 painter=2 player=5 politician=11 singer=10 sport=3 writer=7 27.466
WikiProFem-m csv, graphml, croissant.json 468 8338 122728 196838 631 actor=141 athletic=25 football=24 journalist=16 painter=16 player=32 politician=81 singer=69 sport=18 writer=46 301.718
WikiProFem-m-train csv, graphml, croissant.json 330 5587 89922 140505 600 actor=98 athletic=18 football=17 journalist=9 painter=13 player=22 politician=57 singer=48 sport=14 writer=34 217.699
WikiProFem-m-val csv, graphml, croissant.json 68 1367 21978 31230 409 actor=21 athletic=4 football=3 journalist=4 painter=1 player=5 politician=13 singer=11 sport=1 writer=5 46.793
WikiProFem-m-test csv, graphml, croissant.json 70 1387 21305 29919 394 actor=22 athletic=3 football=4 journalist=3 painter=2 player=5 politician=11 singer=10 sport=3 writer=7 46.317
WikiProFem-l csv, graphml, croissant.json 468 8338 248012 413895 722 actor=141 athletic=25 football=24 journalist=16 painter=16 player=32 politician=81 singer=69 sport=18 writer=46 768.99
WikiProFem-l-train csv, graphml, croissant.json 330 5587 183710 297686 676 actor=98 athletic=18 football=17 journalist=9 painter=13 player=22 politician=57 singer=48 sport=14 writer=34 544.893
WikiProFem-l-val csv, graphml, croissant.json 68 1367 46018 67193 492 actor=21 athletic=4 football=3 journalist=4 painter=1 player=5 politician=13 singer=11 sport=1 writer=5 116.758
WikiProFem-l-test csv, graphml, croissant.json 70 1387 44193 63563 472 actor=22 athletic=3 football=4 journalist=3 painter=2 player=5 politician=11 singer=10 sport=3 writer=7 118.524

Dataset Parameters

Parameter Value
Min valid summary edges 5
Random walk depth length 3
Min random walk number-small 100
Min random walk number-medium 150
Min random walk number-large 300
Max random walk number-small 300
Max random walk number-medium 600
Max random walk number-large 1800
Bridges number 5

Graph Structure

In the following you can see a sample of the graph format (we highly recommend using our toolkit to load the datasets):

CSV Format

After unzipping {variant}-{size}-{dataset_type}.zip file, you will find the following CSV files:

{variant}-{size}-{dataset_type}-entities.csv contains entities. An entity is a Wikidata item (node) in our
dataset.

Field Description datatype
id incremental integer starting by zero int
entity Wikidata qid, e.g. Q76 string
wikidata_label Wikidata label (nullable) string
wikidata_desc Wikidata description (nullable) string
wikipedia_title Wikipedia title (nullable) string
wikipedia_id Wikipedia page id (nullable) long

{variant}-{size}-{dataset_type}-root-entities.csv contains root entities. A root entity is a seed node
described previously.

Field Description datatype
entity id key in {variant}-{size}-{dataset_type}-entities.csv int
category category string

{variant}-{size}-{dataset_type}-predicates.csv contains predicates. A predicate is a Wikidata property or a
describing
a connection.

Field Description datatype
id incremental integer starting by zero int
predicate Wikidata Property id, e.g. P121 string
predicate_label Wikidata Property label (nullable) string
predicate_desc Wikidata Property description (nullable) string

{variant}-{size}-{dataset_type}-triples.csv contains triples. A triple is an edge between two entities with a
predicate.

Field Description datatype
subject id key in {variant}-{size}-{dataset_type}-entities.csv int
predicate id key in {variant}-{size}-{dataset_type}-predicates.csv int
object id key in {variant}-{size}-{dataset_type}-entities.csv int

{viariant}_{size}_{dataset_type}-ground-truths.csv contains ground truth triples. A ground truth triple is an
edge that
is marked as a summary for a root entity.

Field Description datatype
root_entity entity in {variant}-{size}-{dataset_type}-root-entities.csv int
subject id key in {variant}-{size}-{dataset_type}-entities.csv int
predicate id key in {variant}-{size}-{dataset_type}-predicates.csv int
object id key in {variant}-{size}-{dataset_type}-entities.csv int

Note: for this file one of the columns subject or object is equal to the root_entity.

Example of CSV Files

# entities.csv
id,entity,wikidata_label,wikidata_desc,wikipedia_title,wikipedia_id
0,Q43416,Keanu Reeves,Canadian actor (born 1964),Keanu_Reeves,16603
1,Q3820,Beirut,capital and largest city of Lebanon,Beirut,37428
2,Q639669,musician,person who composes, conducts or performs music,Musician,38284
3,Q219150,Constantine,2005 film directed by Francis Lawrence,Constantine_(film),1210303
# root-entities.csv
entity,category
0,Q43416,actor
# predicates.csv
id,predicate,predicate_label,predicate_desc
0,P19,place of birth,location where the subject was born
1,P106,occupation,occupation of a person; see also "field of work" (Property:P101), "position held" (Property:P39)
2,P161,cast member,actor in the subject production [use "character role" (P453) and/or "name of the character role" (P4633) as qualifiers] [use "voice actor" (P725) for voice-only role]
# triples.csv
subject,predicate,object
0,0,1
0,1,2
3,2,0
# ground-truth.csv
root_entity,subject,predicate,object
0,0,0,1
3,3,2,0

GraphML Example

The same graph can be represented in GraphML format.

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
    <key id="d9" for="edge" attr.name="summary_for" attr.type="string"/>
    <key id="d8" for="edge" attr.name="predicate_desc" attr.type="string"/>
    <key id="d7" for="edge" attr.name="predicate_label" attr.type="string"/>
    <key id="d6" for="edge" attr.name="predicate" attr.type="string"/>
    <key id="d5" for="node" attr.name="category" attr.type="string"/>
    <key id="d4" for="node" attr.name="is_root" attr.type="boolean"/>
    <key id="d3" for="node" attr.name="wikidata_desc" attr.type="string"/>
    <key id="d2" for="node" attr.name="wikipedia_title" attr.type="string"/>
    <key id="d1" for="node" attr.name="wikipedia_id" attr.type="long"/>
    <key id="d0" for="node" attr.name="wikidata_label" attr.type="string"/>
    <graph edgedefault="directed">
        <node id="Q43416">
            <data key="d0">Keanu Reeves</data>
            <data key="d1">16603</data>
            <data key="d2">Keanu_Reeves</data>
            <data key="d3">Canadian actor (born 1964)</data>
            <data key="d4">True</data>
            <data key="d5">actor</data>
        </node>
        <node id="Q3820">
            <data key="d0">Beirut</data>
            <data key="d1">37428</data>
            <data key="d2">Beirut</data>
            <data key="d3">capital and largest city of Lebanon</data>
        </node>
        <node id="Q639669">
            <data key="d0">musician</data>
            <data key="d1">38284</data>
            <data key="d2">Musician</data>
            <data key="d3">person who composes, conducts or performs music</data>
        </node>
        <node id="Q219150">
            <data key="d0">Constantine</data>
            <data key="d1">1210303</data>
            <data key="d2">Constantine_(film)</data>
            <data key="d3">2005 film directed by Francis Lawrence</data>
        </node>
        <edge source="Q43416" target="Q3820" id="P19">
            <data key="d6">P19</data>
            <data key="d7">place of birth</data>
            <data key="d8">location where the subject was born</data>
            <data key="d9">Q43416</data>
        </edge>
        <edge source="Q43416" target="Q639669" id="P106">
            <data key="d6">P106</data>
            <data key="d7">occupation</data>
            <data key="d8">occupation of a person; see also "field of work" (Property:P101), "position held"
                (Property:P39)
            </data>
        </edge>
        <edge source="Q219150" target="Q43416" id="P106">
            <data key="d6">P161</data>
            <data key="d7">cast member</data>
            <data key="d8">actor in the subject production [use "character role" (P453) and/or "name of the character
                role" (P4633) as qualifiers] [use "voice actor" (P725) for voice-only role]
            </data>
            <data key="d9">Q43416</data>
        </edge>
    </graph>
</graphml>
wiki-entity-summarization - 1.0.4

Published by msorkhpar 4 months ago

We have generated datasets using A Brief History of Human Time project.
These datasets contain different sets of seed nodes, categorized by various human arts and professions.

dataset (variant, size, None/train/val/test) #roots #smmaries #nodes #edges #labels roots category distribution Running Time(sec)
WikiLitArt-s-train csv, graphml, croissant.json 346 7234 61885 96497 508 actor=105 composer=24 film=29 novelist=17 painter=42 poet=27 screenwriter=12 singer=50 writer=40 66.023
WikiLitArt-s-val csv, graphml, croissant.json 74 1572 14763 20795 340 actor=23 composer=5 film=6 novelist=4 painter=9 poet=6 screenwriter=2 singer=11 writer=8 14.364
WikiLitArt-s-test csv, graphml, croissant.json 74 1626 15861 22029 350 actor=22 composer=6 film=6 novelist=3 painter=8 poet=6 screenwriter=3 singer=11 writer=9 14.6
WikiLitArt-m csv, graphml, croissant.json 494 10416 128061 220263 604 actor=150 composer=35 film=41 novelist=24 painter=59 poet=39 screenwriter=17 singer=72 writer=57 155.368
WikiLitArt-m-train csv, graphml, croissant.json 346 7234 93251 155667 566 actor=105 composer=24 film=29 novelist=17 painter=42 poet=27 screenwriter=12 singer=50 writer=40 111.636
WikiLitArt-m-val csv, graphml, croissant.json 74 1572 22214 33547 375 actor=23 composer=5 film=6 novelist=4 painter=9 poet=6 screenwriter=2 singer=11 writer=8 22.957
WikiLitArt-m-test csv, graphml, croissant.json 74 1626 24130 35980 394 actor=22 composer=6 film=6 novelist=3 painter=8 poet=6 screenwriter=3 singer=11 writer=9 26.187
WikiLitArt-l csv, graphml, croissant.json 494 10416 239491 466905 703 actor=150 composer=35 film=41 novelist=24 painter=59 poet=39 screenwriter=17 singer=72 writer=57 353.113
WikiLitArt-l-train csv, graphml, croissant.json 346 7234 176057 332279 661 actor=105 composer=24 film=29 novelist=17 painter=42 poet=27 screenwriter=12 singer=50 writer=40 244.544
WikiLitArt-l-val csv, graphml, croissant.json 74 1572 42745 71734 446 actor=23 composer=5 film=6 novelist=4 painter=9 poet=6 screenwriter=2 singer=11 writer=8 57.263
WikiLitArt-l-test csv, graphml, croissant.json 74 1626 46890 77931 493 actor=22 composer=6 film=6 novelist=3 painter=8 poet=6 screenwriter=3 singer=11 writer=9 60.466
WikiCinema-s csv, graphml, croissant.json 493 11750 70753 126915 469 actor=405 film=88 118.014
WikiCinema-s-train csv, graphml, croissant.json 345 8374 52712 89306 437 actor=284 film=61 84.364
WikiCinema-s-val csv, graphml, croissant.json 73 1650 13362 19280 305 actor=59 film=14 18.651
WikiCinema-s-test csv, graphml, croissant.json 75 1744 14777 21567 313 actor=62 film=13 19.851
WikiCinema-m csv, graphml, croissant.json 493 11750 101529 196061 541 actor=405 film=88 196.413
WikiCinema-me-train csv, graphml, croissant.json 345 8374 75900 138897 491 actor=284 film=61 142.091
WikiCinema-m-val csv, graphml, croissant.json 73 1650 19674 30152 344 actor=59 film=14 31.722
WikiCinema-m-test csv, graphml, croissant.json 75 1744 22102 34499 342 actor=62 film=13 33.674
WikiCinema-l csv, graphml, croissant.json 493 11750 185098 397546 614 actor=405 film=88 475.679
WikiCinema-l-train csv, graphml, croissant.json 345 8374 139598 284417 575 actor=284 film=61 333.148
WikiCinema-l-val csv, graphml, croissant.json 73 1650 37352 63744 412 actor=59 film=14 68.62
WikiCinema-l-test csv, graphml, croissant.json 75 1744 43238 74205 426 actor=62 film=13 87.07
WikiPro-s csv, graphml, croissant.json 493 9853 79825 125912 616 actor=58 football=156 journalist=14 lawyer=16 painter=23 player=25 politician=125 singer=27 sport=21 writer=28 126.119
WikiPro-s-train csv, graphml, croissant.json 345 6832 57529 87768 575 actor=41 football=109 journalist=10 lawyer=11 painter=16 player=17 politician=87 singer=19 sport=15 writer=20 89.874
WikiPro-s-val csv, graphml, croissant.json 74 1548 15769 21351 405 actor=9 football=23 journalist=2 lawyer=3 painter=3 player=4 politician=19 singer=4 sport=3 writer=4 21.021
WikiPro-s-test csv, graphml, croissant.json 74 1484 15657 21145 384 actor=8 football=24 journalist=2 lawyer=2 painter=4 player=4 politician=19 singer=4 sport=3 writer=4 21.743
WikiPro-m csv, graphml, croissant.json 493 9853 119305 198663 670 actor=58 football=156 journalist=14 lawyer=16 painter=23 player=25 politician=125 singer=27 sport=21 writer=28 208.157
WikiPro-m-train csv, graphml, croissant.json 345 6832 86434 138676 633 actor=41 football=109 journalist=10 lawyer=11 painter=16 player=17 politician=87 singer=19 sport=15 writer=20 141.563
WikiPro-m-val csv, graphml, croissant.json 74 1548 24230 34636 463 actor=9 football=23 journalist=2 lawyer=3 painter=3 player=4 politician=19 singer=4 sport=3 writer=4 36.045
WikiPro-m-test csv, graphml, croissant.json 74 1484 24117 34157 462 actor=8 football=24 journalist=2 lawyer=2 painter=4 player=4 politician=19 singer=4 sport=3 writer=4 36.967
WikiPro-l csv, graphml, croissant.json 493 9853 230442 412766 769 actor=58 football=156 journalist=14 lawyer=16 painter=23 player=25 politician=125 singer=27 sport=21 writer=28 489.409
WikiPro-l-train csv, graphml, croissant.json 345 6832 166685 290069 725 actor=41 football=109 journalist=10 lawyer=11 painter=16 player=17 politician=87 singer=19 sport=15 writer=20 334.864
WikiPro-l-val csv, graphml, croissant.json 74 1548 48205 74387 549 actor=9 football=23 journalist=2 lawyer=3 painter=3 player=4 politician=19 singer=4 sport=3 writer=4 84.089
WikiPro-l-test csv, graphml, croissant.json 74 1484 47981 72845 546 actor=8 football=24 journalist=2 lawyer=2 painter=4 player=4 politician=19 singer=4 sport=3 writer=4 92.545
WikiProFem-s csv, graphml, croissant.json 468 8338 79926 123193 571 actor=141 athletic=25 football=24 journalist=16 painter=16 player=32 politician=81 singer=69 sport=18 writer=46 177.63
WikiProFem-s-train csv, graphml, croissant.json 330 5587 58329 87492 521 actor=98 athletic=18 football=17 journalist=9 painter=13 player=22 politician=57 singer=48 sport=14 writer=34 127.614
WikiProFem-s-val csv, graphml, croissant.json 68 1367 14148 19360 344 actor=21 athletic=4 football=3 journalist=4 painter=1 player=5 politician=13 singer=11 sport=1 writer=5 29.081
WikiProFem-test csv, graphml, croissant.json 70 1387 13642 18567 360 actor=22 athletic=3 football=4 journalist=3 painter=2 player=5 politician=11 singer=10 sport=3 writer=7 27.466
WikiProFem-m csv, graphml, croissant.json 468 8338 122728 196838 631 actor=141 athletic=25 football=24 journalist=16 painter=16 player=32 politician=81 singer=69 sport=18 writer=46 301.718
WikiProFem-m-train csv, graphml, croissant.json 330 5587 89922 140505 600 actor=98 athletic=18 football=17 journalist=9 painter=13 player=22 politician=57 singer=48 sport=14 writer=34 217.699
WikiProFem-m-val csv, graphml, croissant.json 68 1367 21978 31230 409 actor=21 athletic=4 football=3 journalist=4 painter=1 player=5 politician=13 singer=11 sport=1 writer=5 46.793
WikiProFem-m-test csv, graphml, croissant.json 70 1387 21305 29919 394 actor=22 athletic=3 football=4 journalist=3 painter=2 player=5 politician=11 singer=10 sport=3 writer=7 46.317
WikiProFem-l csv, graphml, croissant.json 468 8338 248012 413895 722 actor=141 athletic=25 football=24 journalist=16 painter=16 player=32 politician=81 singer=69 sport=18 writer=46 768.99
WikiProFem-l-train csv, graphml, croissant.json 330 5587 183710 297686 676 actor=98 athletic=18 football=17 journalist=9 painter=13 player=22 politician=57 singer=48 sport=14 writer=34 544.893
WikiProFem-l-val csv, graphml, croissant.json 68 1367 46018 67193 492 actor=21 athletic=4 football=3 journalist=4 painter=1 player=5 politician=13 singer=11 sport=1 writer=5 116.758
WikiProFem-l-test csv, graphml, croissant.json 70 1387 44193 63563 472 actor=22 athletic=3 football=4 journalist=3 painter=2 player=5 politician=11 singer=10 sport=3 writer=7 118.524

Dataset Parameters

Parameter Value
Min valid summary edges 5
Random walk depth length 3
Min random walk number-small 100
Min random walk number-medium 150
Min random walk number-large 300
Max random walk number-small 300
Max random walk number-medium 600
Max random walk number-large 1800
Bridges number 5

Graph Structure

In the following, you can see a sample of the graph format (we highly recommend using our toolkit to load the datasets):

CSV Format

After unzipping {variant_index}_{size}_{dataset_type}.zip file, you will find the following CSV files:

{variant_index}_{size}_{dataset_type}__entities.csv contains entities. An entity is a Wikidata item (node) in our
dataset.

Field Description datatype
id incremental integer starting by zero int
entity Wikidata qid, e.g. Q76 string
wikidata_label Wikidata label (nullable) string
wikidata_desc Wikidata description (nullable) string
wikipedia_title Wikipedia title (nullable) string
wikipedia_id Wikipedia page id (nullable) long

{variant_index}_{size}_{dataset_type}__root_entities.csv contains root entities. A root entity is a seed node
described previously.

Field Description datatype
entity id key in {variant_index}_{size}_{dataset_type}__entities.csv int
category category string

{variant_index}_{size}_{dataset_type}__predicates.csv contains predicates. A predicate is a Wikidata property or a
describing
a connection.

Field Description datatype
id incremental integer starting by zero int
predicate Wikidata Property id, e.g. P121 string
predicate_label Wikidata Property label (nullable) string
predicate_desc Wikidata Property description (nullable) string

{variant_index}_{size}_{dataset_type}__triples.csv contains triples. A triple is an edge between two entities with a
predicate.

Field Description datatype
subject id key in {variant_index}_{size}_{dataset_type}__entities.csv int
predicate id key in {variant_index}_{size}_{dataset_type}__predicates.csv int
object id key in {variant_index}_{size}_{dataset_type}__entities.csv int

{variant_index}_{size}_{dataset_type}__ground_truths.csv contains ground truth triples. A ground truth triple is an
edge that
is marked as a summary for a root entity.

Field Description datatype
root_entity entity in {variant_index}_{size}_{dataset_type}__root_entities.csv int
subject id key in {variant_index}_{size}_{dataset_type}__entities.csv int
predicate id key in {variant_index}_{size}_{dataset_type}__predicates.csv int
object id key in {variant_index}_{size}_{dataset_type}__entities.csv int

Note: for this file one of the columns subject or object is equal to the root_entity.

Example of CSV Files

# entities.csv
id,entity,wikidata_label,wikidata_desc,wikipedia_title,wikipedia_id
0,Q43416,Keanu Reeves,Canadian actor (born 1964),Keanu_Reeves,16603
1,Q3820,Beirut,capital and largest city of Lebanon,Beirut,37428
2,Q639669,musician,person who composes, conducts or performs music,Musician,38284
3,Q219150,Constantine,2005 film directed by Francis Lawrence,Constantine_(film),1210303
# root_entities.csv
entity,category
0,Q43416,actor
# predicates.csv
id,predicate,predicate_label,predicate_desc
0,P19,place of birth,location where the subject was born
1,P106,occupation,occupation of a person; see also "field of work" (Property:P101), "position held" (Property:P39)
2,P161,cast member,actor in the subject production [use "character role" (P453) and/or "name of the character role" (P4633) as qualifiers] [use "voice actor" (P725) for voice-only role]
# triples.csv
subject,predicate,object
0,0,1
0,1,2
3,2,0
# ground_truth.csv
root_entity,subject,predicate,object
0,0,0,1
3,3,2,0

GraphML Example

The same graph can be represented in GraphML format.

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
    <key id="d9" for="edge" attr.name="summary_for" attr.type="string"/>
    <key id="d8" for="edge" attr.name="predicate_desc" attr.type="string"/>
    <key id="d7" for="edge" attr.name="predicate_label" attr.type="string"/>
    <key id="d6" for="edge" attr.name="predicate" attr.type="string"/>
    <key id="d5" for="node" attr.name="category" attr.type="string"/>
    <key id="d4" for="node" attr.name="is_root" attr.type="boolean"/>
    <key id="d3" for="node" attr.name="wikidata_desc" attr.type="string"/>
    <key id="d2" for="node" attr.name="wikipedia_title" attr.type="string"/>
    <key id="d1" for="node" attr.name="wikipedia_id" attr.type="long"/>
    <key id="d0" for="node" attr.name="wikidata_label" attr.type="string"/>
    <graph edgedefault="directed">
        <node id="Q43416">
            <data key="d0">Keanu Reeves</data>
            <data key="d1">16603</data>
            <data key="d2">Keanu_Reeves</data>
            <data key="d3">Canadian actor (born 1964)</data>
            <data key="d4">True</data>
            <data key="d5">actor</data>
        </node>
        <node id="Q3820">
            <data key="d0">Beirut</data>
            <data key="d1">37428</data>
            <data key="d2">Beirut</data>
            <data key="d3">capital and largest city of Lebanon</data>
        </node>
        <node id="Q639669">
            <data key="d0">musician</data>
            <data key="d1">38284</data>
            <data key="d2">Musician</data>
            <data key="d3">person who composes, conducts or performs music</data>
        </node>
        <node id="Q219150">
            <data key="d0">Constantine</data>
            <data key="d1">1210303</data>
            <data key="d2">Constantine_(film)</data>
            <data key="d3">2005 film directed by Francis Lawrence</data>
        </node>
        <edge source="Q43416" target="Q3820" id="P19">
            <data key="d6">P19</data>
            <data key="d7">place of birth</data>
            <data key="d8">location where the subject was born</data>
            <data key="d9">Q43416</data>
        </edge>
        <edge source="Q43416" target="Q639669" id="P106">
            <data key="d6">P106</data>
            <data key="d7">occupation</data>
            <data key="d8">occupation of a person; see also "field of work" (Property:P101), "position held"
                (Property:P39)
            </data>
        </edge>
        <edge source="Q219150" target="Q43416" id="P106">
            <data key="d6">P161</data>
            <data key="d7">cast member</data>
            <data key="d8">actor in the subject production [use "character role" (P453) and/or "name of the character
                role" (P4633) as qualifiers] [use "voice actor" (P725) for voice-only role]
            </data>
            <data key="d9">Q43416</data>
        </edge>
    </graph>
</graphml>
wiki-entity-summarization - 1.0.3

Published by msorkhpar 5 months ago

We're excited to release the first version of our dataset, featuring:

Four Seed Node Sets: Each is available in small, medium, and large sizes and covers human arts, cinema, and famous influencers in human history.
Structured Splits: Each set is divided into train, validation, and test graphs.
48 graphs in total

wiki-entity-summarization - v0.1-alpha

Published by msorkhpar 5 months ago

First RC version

Badges
Extracted from project README
arXiv
Related Projects