This is a project to analyze and add identifiers for words from Folkets Lexicon to Wikidata. A unique UUID-based identifier have been added to facilitate linking with Wikidata.
An API has been added for easy lookup. The goal is to upload it to Toolforge to facilitate integration of the data into Wikidata via a property since this might be a good freely licensed source for Swedish and English lexemes to use.
The output jsonl is versioned v1, v2, etc. The latest number is always consistent with the model in the code, old versions are kept for backwards compatibility. For each new version, completely random ids are generated.
Number of words: 39407 Number of words with sound file: 21947 Number of words missing a lexical category: 7441 Number of words with a lexical category: 31966 Number of words with a lexical category and sound file: 21280 Number of idioms: 1944 Number of examples: 12574
The source code is licensed under GPLv3+ The data directory files are CC-BY-SA 2.5
Original source is http://folkets-lexikon.csc.kth.se/folkets/ The data was downloaded during may 2023. No updates to the dump seem to happen over time since the project is not maintained anymore.
Run the api:
$ uvicorn api:app --reload
Build pack:
Make sure pack-cli-bin and docker is installed from AUR:
$ pack build --builder tools-harbor.wmcloud.org/toolforge/heroku-builder:22 myimage
$ docker run -e PORT=8000 -p 8000:8000 --rm --entrypoint web myimage
# navigate to http://127.0.0.1:8000 to check that it works