Download onnx models here:
Model | Precision | Size | URL | Demo |
---|---|---|---|---|
LLaMa-7B | fp32 | 26GB | huggingface | demo_llama.py |
LLaMa-7B | fp16 | 13GB | huggingface or 硬件模型库 | demo_llama.py |
RWKV-4-palm-430M | fp16 | 920MB | huggingface or 硬件模型库 | demo_rwkv.py |
05/18 release RWKV-4 onnx models, standalone script and LLM structure comparison
05/09 trt output wrong value until issue 2928 solved
04/19 remove GPTQ zero point guidance
04/18 export mixed-precision quant table from GPTQ-for-LLaMa
04/11 add 13GB onnx-fp16 models
04/11 add memory pool, support 2GB RAM laptop ⭐
04/10 reduce onnx model size to 26GB
04/10 support temperature
add topk
logits warp
04/07 add onnxruntime demo
04/05 init project
torch
or transformers
requiredWhy do this ?
graphviz
crashed on LLaMa model. LLM visualization tool must support nest or operator folding featuredd
a big single fileHere is the graph to call LLaMa (RWKV is similar):
Try LLaMa onnxruntime
demo, no torch
required, and the precision has been checked.
$ python3 -m pip install -r requirements.txt
$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"
..
# If you only have 4GB memory, use `--poolsize`
$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour" --poolsize 4
..
Bonjour.
# Try more options
$ python3 demo_llama.py --help
Use demo_rwkv.py to run RWKV:
$ python3 demo_rwkv.py ${FP16_ONNX_DIR}
$ git clone https://github.com/BlinkDL/ChatRWKV --depth=1
$ cp llama.onnx/tools/onnx_RWKV_in_150_lines.py ChatRWKV
$ cd ChatRWKV
$ mkdir models
$ python3 onnx_RWKV_in_150_lines.py
Then you would get onnx files.
$ ls -lah models
..
STEP1 Convert to HF format
These models converted from alpaca huggingface.
If you are using LLaMa or llama.cpp, convert it to HF format first. Here are steps:
# install transformers master
$ git clone https://github.com/huggingface/transformers
$ cd transformers && python3 setup.py install
..
$ cd src/transformers
$ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${LLaMa_PATH} --model_size 7B --output_dir ${HF_PATH}
If you are using alpaca-lora, use this script to merge LoRA weights.
If you are using alpaca, go STEP2.
STEP2 torch.onnx.export
Checkout transformers to this hacking branch, run single inference.
$ python3 tools/export-onnx.py ${PATH_ALPACA_7B}
STEP3 convert to fp16/tvm
Use onnxconverter-common.float16
$ cd tools
$ python3 -m pip install -r requirements.txt
$ python3 convert-fp32-to-fp16.py ${FP32_PATH} ${FP16_PATH}
Or use relay.vm
to convert tvm
$ cd tools
$ python3 convert-to-tvm.py ${ONNX_PATH} ${OUT_DIR}
onnxruntime-cpu
and torch-cuda
, and the maximum error is 0.002, not baddemo_llama.py
state is equivalent to these configurationstemperature=0.1
total_tokens=2000
top_p=1.0
top_k=40
repetition_penalty=1.0