Open Source Ecosystems

KorQuAD-beginner

This is a repository for those who are trying KorQuAD for the first time. I got f1 88.468 and EM(ExactMatch) 68.947, just with the BERT-multilingual(single) model, Only google-research's BERT github.

While it was not difficult to achieve results simply by fine-tuning KorQuAD data, using codalab was very difficult for the first time. So this repository will also explain in detail how to use codalab, how to upload your result to KorQuAD leaderboard.

1. fine-tuning KorQuAD

It is very easy to fine-tuning if you use google-research's BERT github. Everything is possible with one line of this command. If you still use Google-research's BERT, Korean subtokens are processed as UNK because of unicodedata.normalize("NFD", text). Please see here and pull request code, if you want to more detail. I have already applied Korean Issue in my code. Also I used BERT-Base, Multilingual Cased (New, recommended) for pre-trained weight(104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters).

$ python run_squad.py \
  --vocab_file=multi_cased_L-12_H-768_A-12/vocab.txt \
  --bert_config_file=multi_cased_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=multi_cased_L-12_H-768_A-12/bert_model.ckpt \
  --do_train=True \
  --train_file=config/KorQuAD_v1.0_train.json \
  --do_predict=True \
  --predict_file=config/KorQuAD_v1.0_dev.json \
  --train_batch_size=4 \
  --num_train_epochs=3.0 \
  --max_seq_length=384 \
  --output_dir=./output \
  --do_lower_case=False

I trained in GeForce RTX 2080 and 11GB in memory, and it took about three to four hours.

Then you can evaluate F1 and EM score.

$ python evaluate-v1.0.py config/KorQuAD_v1.0_dev.json output/predictions.json

{"f1": 88.57661204608746, "exact_match": 69.05091790786284}

Put your checkpoint files in config folder

2. Codalab Guide for submit KorQuAD

0. Sign up in Codalab

1. Install Codalab Cli

The CLI only works with Python 2.7 right now, so if Python 3 is your default, I recommend Anaconda configure setting, rather than virtualenv.

$ conda create -n py27 python=2.7 anaconda
$ conda activate py27
(py27) $ pip install codalab -U --user

2. Create New WorkSheet in Codalab

Click the New Worksheet button in the upper right corner and name the worksheet.

3. Test predictions.json that is created on fine-tuning.

If you don't want to see evaluation of Dev set, Just skip this part.

Click Upload Button and upload output/predictions.json that is created on fine-tuning.
In the web interface terminal at the top of the page, type the following command:

<name-of-your-uploaded-prediction-bundle> is uuid[0:8] of output/predictions.json.
```
# web interface terminal
cl macro korquad-utils/dev-evaluate-v1.0 <name-of-your-uploaded-prediction-bundle>
```

4. Upload `src` , `config` folder

The reason for this division is that it takes a long time to upload the config folder (checkpoint files) every time.

src folder : only source files to run in Codalab, ex) run_squad.py
config folder : bert_config.json, KorQuAD_v1.0_dev.json, KorQuAD_v1.0_dev.json, vocab.txt, checkpoint files(fine-tuning checkpoint file)

Upload folders using Codalab Cli. You can also upload using the Web UI, but use cli because it's slow for folders.

# command on Anaconda-Python2.7
# cl upload <folder-name> -n <bundle-name>
$ cl upload Leaderboard -n src
$ cl upload config -n config

5. Run learning model for dev set

# web interface terminal
cl add bundle korquad-data//KorQuAD_v1.0_dev.json .

You can see bundle spec . doesn't match any bundles, but It's not a problem to proceed.

Then, Run learning model for dev set!

:src, :config meaning of : <bundle-name>

# command on Anaconda-Python2.7
$ cl run :KorQuAD_v1.0_dev.json :src :config "python src/run_KorQuAD.py --bert_config_file=config/bert_config.json --vocab_file=config/vocab.txt --init_checkpoint=config/model.ckpt-45000 --do_predict=True --output_dir=output config/KorQuAD_v1.0_dev.json predictions.json" -n run-predictions --request-docker-image tensorflow/tensorflow:1.12.0-gpu-py3 --request-memory 11g --request-gpus 1

Trouble shooting

About argument : Original KorQuAD tutorial, They recommend python arguments like this.

python src/<path-to-prediction-program> <input-data-json-file> <output-prediction-json-path>

above command match below: <bundle-name>/file-name

python src/run_KorQuAD.py --bert_config_file=config/bert_config.json --vocab_file=config/vocab.txt --init_checkpoint=config/model.ckpt-45000 --do_predict=True --output_dir=output config/KorQuAD_v1.0_dev.json predictions.json

About ImportError: libcuda.so.1 error : using tensorflow 1.12.0 version docker with python3 and --request-gpus 1 to clearly specify the pin number.
About out of memory : --request-memory 11g

6. Add prediction file(result of part 5) in Bundle

Add the prediction file to the bundle.: MODELNAME can't contain spaces and special characters.

# web interface terminal
cl make run-predictions/predictions.json -n predictions-{MODELNAME}

Let's see if we can evaluate the prediction file of dev set.

# web interface terminal
cl macro korquad-utils/dev-evaluate-v1.0 predictions-{MODELNAME}

7 Submit(제출)

After this, To submit your result in leader board, see Original KorQuAD tutorial of part 3.

License

These Repository are all released under the same license as the source code (Apache 2.0) with google-research BERT, Tae Hwan Jung

Author

Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).
Author Email : [email protected]
Reference : Original KorQuAD tutorial, The Land Of Galaxy Blog

Related Projects

dfgo

Differentiable Factor Graph Optimization for Learning Smoothers @ IROS 2021

16 Aug 2021 78

starcoder2

Home of StarCoder2!

08 Dec 2023 1,732

CodeAssist

CodeAssist is an advanced code completion tool that provides high-quality code completions for Py...

09 Feb 2022 54

GPT2-Chinese

Chinese version of GPT2 training code, using BERT tokenizer.

31 May 2019 7,448

KorQuAD-Question-Generation

question generation model with KorQuAD dataset

25 Nov 2020 29

BertSum

Code for paper Fine-tune BERT for Extractive Summarization

25 Mar 2019 1,464

Tencent2020_Rank1st

The code for 2020 Tencent College Algorithm Contest, and the online result ranks 1st.

22 Jul 2020 1,023

BERT-BiLSTM-CRF-NER

Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private S...

25 Nov 2018 4,692

FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vi...

19 Mar 2023 36,628

roberta_zh

RoBERTa中文预训练模型: RoBERTa for Chinese

02 Sep 2019 2,600

pytextclassifier

pytextclassifier is a toolkit for text classification. 文本分类，LR，Xgboost，TextCNN，FastText，TextRNN，B...

28 Apr 2017 482

biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedic...

24 Jan 2019 1,931

learn-langchain

18 Apr 2023 263

pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

20 Feb 2023 167

ImageCaptioning.pytorch

I decide to sync up this repo and self-critical.pytorch. (The old master is in old master branch ...

10 Feb 2017 1,419

KorQuAD-beginner

KorQuAD-beginner

1. fine-tuning KorQuAD

2. Codalab Guide for submit KorQuAD

0. Sign up in Codalab

1. Install Codalab Cli

2. Create New WorkSheet in Codalab

3. Test predictions.json that is created on fine-tuning.

4. Upload src , config folder

5. Run learning model for dev set

Trouble shooting

6. Add prediction file(result of part 5) in Bundle

7 Submit(제출)

License

Author

Related Projects

dfgo

starcoder2

CodeAssist

GPT2-Chinese

KorQuAD-Question-Generation

BertSum

Tencent2020_Rank1st

BERT-BiLSTM-CRF-NER

FastChat

roberta_zh

pytextclassifier

biobert

learn-langchain

pretraining-with-human-feedback

ImageCaptioning.pytorch

4. Upload `src` , `config` folder