Open Source Ecosystems

Hanja Graph Project

Author: Pablo Estrada < pablo (at) snu (dot) ac (dot) kr >

This repo contains the code resulting from the Hanja Graph Project, developed by Pablo Estrada, as a sideproject.

###Folders

Crawlers - This is the folder containing the crawlers to download the data.
At the moment of this writting, there is just one crawler implemented.
Formatters - This is the folder containing the small python scripts that take
the files created by the crawlers and output an acceptable graph-format file.
Analysis - This folder contains the scripts that do analysis over the graph.
Test_data - This folder contains some data provided for test if anyone would
just want to have the data after all the processing
- graph.graphml - This contains the full graph, with links between hanja
  and korean words. No bipartite distinction.
- hanja_list.json - This contains the list of hanjas as returned by the
  crawler.
- words.nospace.json - This contains the list of korean words, as
  returned by the crawler.
- korean_unip_projection.graphml - This file contains the projection of
  the korean words from the bipartite graph. In the current version, the edge
  weights are 1 or 2, depending on how many chinese characters are shared
  between two words.

Scrapers/Crawlers

Scrape Kanjis

These are the utilities to scrape the Kanji information in 'http://www.manythings.org/kanji/d/'. They all serve different purposes.

scrape_kanji.py - This is the main scraper. It gets the data and outputs a JSON file with
words, and Kanjis. This JSON file can be used to generate the graphml file.
make_kanji_graph.py - This takes the JSON output from scrape.py, and makes it into a Graphml
file.

Scrape naver

Not yet available : )

Obtaining synonyms

Obtaining the synonyms training set

To generate the synonyms training set we need to follow these steps:

(1) Use the graph dataset to obtain the features of each node pair

$> nohup ./bin/generate_csv_p.py data/hanja_unip.graphml res.csv 4

(2) Obtain the 'zeros' in the training set. We do this through random sampling from the main CSV file

$> shuf -n 1000 data/res.csv > data/training_zeros

$> ./bin/removeFirstColumns training_zeros data/training_non_related.csv

(3) Obtain the 'synonyms' in the training set * Obtain a random set of hanjas from the res.csv file

$> shuf -n 1000 data/res.csv | awk -F "," '{print $3}' > data/tmp

$> cat data/tmp | sort | uniq > data/random_hanjas.txt

Obtain synonyms and antonyms for these hanjas

$> ./bin/scrapeSynonyms data/random_hanjas.txt data/antonyms_hanja.txt data/synonyms_hanja.txt

Get the features from these pairs of synonyms or antonyms

$> ./bin/extractPairsFromCsv data/synonyms_hanja.txt data/res.csv data/synonyms.csv

(4). Use the result to run a classification scheme ; )

Runing the classification script

(1) Run the classification script

$> ./bin/get_synonyms.py data/res.csv data/synonyms_training.csv data/training_non_related.csv data/guess_syn1.txt

(2) Verify the results

$> ./bin/checkSynonyms data/guess_syn1.txt

(3) Verify the data by hand // Since Naver does not know all the Hanja synonyms

$> ./bin/get_pairs_meanings.py data/guess_syn1.txt output [amount]

Related Projects

Article-Summarizer

Uses frequency analysis to summarize text.

04 Jan 2017 183

SmartLMVocabs

Improving Language Model Performance through Smart Vocabularies

22 Nov 2018 6

HarvestText

文本挖掘和预处理工具（文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等），无监督或弱监督方法

19 Nov 2018 2,391

ailearning

AiLearning：数据分析+机器学习实战+线性代数+PyTorch+NLTK+TF2

25 Feb 2017 38,884

nlp_xiaojiang

自然语言处理（nlp），小姜机器人（闲聊检索式chatbot），BERT句向量-相似度（Sentence Similarity），XLNET句向量-相似度（text xlnet embeddin...

09 Apr 2019 1,519

ryuujouji

(Japanese language) Tries to determine the readings of individual characters in a word, given its...

20 May 2011 10

cjkradlib

Generate compositions, supercompositions and variants for a given Hanzi / Kanji

14 Oct 2018 7

HanziLevelUp

A Hanzi learning suite, with levels based on Hanzi Level Project, aka. another attempt to clone W...

16 May 2018 12

DAT8

General Assembly's 2015 Data Science course in Washington, DC

07 Aug 2015 1,606

hanzi-writer-data-jp

The data used by Hanzi Writer for Japanese

12 Aug 2019 18

GPT2-Chinese

Chinese version of GPT2 training code, using BERT tokenizer.

31 May 2019 7,448

nlu_datasets

Datasets for intent classification and entity extraction including converters.

20 Nov 2018 5

FASPell

2019-SOTA简繁中文拼写检查工具：FASPell Chinese Spell Checker (Chinese Spell Check / 中文拼写检错 / 中文拼写纠错 / 中文拼写检查)

26 Sep 2019 1,199

Chinese-Word-Vectors

100+ Chinese Word Vectors 上百种预训练中文词向量

09 Jan 2018 11,781

pyhanlp

中文分词

19 Mar 2018 3,119

hanja-graph