The purpose of this script is to get all the senses for all the words in a SRT-file from Wikidata
GPL-3.0 License
This tool helps language mentors overview all lexeme forms found in sentences or SRT-files.
It uses SpaCy to tokenize each subtitle in the srt and looks up the corresponding lexeme in Wikidata based on the detected position of speech and token representation.
The CLI script encourages the user to contribute to Wikidata if senses are completely missing on the matched lexemes.
The API does not check if senses exists in Wikidata currently and simply does the cleaning and matching and output the result.
poetry install
poetry run python3 -m spacy download en_core_web_sm
These models are recommended over the standard spaCy ones in https://spacy.io/models
python cli.py -i path-to-srt.srt --lang en --spacy_model en_core_web_sm
You can fiddle with the configuration options in config.py
An API using fastapi has been implemented.
It supports 2 fields sent via a HTTP POST request:
After installing Uvicorn, you can start the API in debug mode:
uvicorn api:app --reload
Test it with:
curl -X POST -H "Content-Type: application/json" http://localhost:8000/process_sentence \
-d '{"spacy_model": "en_core_web_sm", "sentence": "This is a test sentence."}'
It should output something like:
{
"data": [
{
"token": "This",
"spacy_pos": "PRON",
"matched_forms": [
"L643260-F1"
]
},
{
"token": "is",
"spacy_pos": "AUX",
"matched_forms": [
"L1883-F4"
]
},
{
"token": "a",
"spacy_pos": "DET",
"matched_forms": [
"L2767-F1"
]
},
{
"token": "test",
"spacy_pos": "NOUN",
"matched_forms": [
"L220909-F1"
]
},
{
"token": "sentence",
"spacy_pos": "NOUN",
"matched_forms": [
"L6117-F1"
]
},
{
"token": ".",
"spacy_pos": "ADJ",
"matched_forms": []
}
]
}
matches: match errors:
matches: match errors:
GPLv3+ with the exeption of the code borrowed from Ordia, see the licens in the file.
Big thanks to Finn Nielsen for writing the spaCy->lexeme function. I improved it a bit for my purposes to get what I wanted.