Author: Pablo Estrada < pablo (at) snu (dot) ac (dot) kr >
This repo contains the code resulting from the Hanja Graph Project, developed by Pablo Estrada, as a sideproject.
###Folders
graph.graphml
- This contains the full graph, with links between hanjahanja_list.json
- This contains the list of hanjas as returned by thewords.nospace.json
- This contains the list of korean words, askorean_unip_projection.graphml
- This file contains the projection ofThese are the utilities to scrape the Kanji information in 'http://www.manythings.org/kanji/d/'. They all serve different purposes.
scrape_kanji.py
- This is the main scraper. It gets the data and outputs a JSON file withmake_kanji_graph.py
- This takes the JSON output from scrape.py
, and makes it into a GraphmlNot yet available : )
To generate the synonyms training set we need to follow these steps:
(1) Use the graph dataset to obtain the features of each node pair
$> nohup ./bin/generate_csv_p.py data/hanja_unip.graphml res.csv 4
(2) Obtain the 'zeros' in the training set. We do this through random sampling from the main CSV file
$> shuf -n 1000 data/res.csv > data/training_zeros
$> ./bin/removeFirstColumns training_zeros data/training_non_related.csv
(3) Obtain the 'synonyms' in the training set
* Obtain a random set of hanjas from the res.csv
file
$> shuf -n 1000 data/res.csv | awk -F "," '{print $3}' > data/tmp
$> cat data/tmp | sort | uniq > data/random_hanjas.txt
$> ./bin/scrapeSynonyms data/random_hanjas.txt data/antonyms_hanja.txt data/synonyms_hanja.txt
$> ./bin/extractPairsFromCsv data/synonyms_hanja.txt data/res.csv data/synonyms.csv
(4). Use the result to run a classification scheme ; )
(1) Run the classification script
$> ./bin/get_synonyms.py data/res.csv data/synonyms_training.csv data/training_non_related.csv data/guess_syn1.txt
(2) Verify the results
$> ./bin/checkSynonyms data/guess_syn1.txt
(3) Verify the data by hand // Since Naver does not know all the Hanja synonyms
$> ./bin/get_pairs_meanings.py data/guess_syn1.txt output [amount]