🚀 this project aims to develop an app using an existing open-source LLM with data collected for domain-specific Jenkins knowledge that can be fine-tuned locally and set up with a proper UI for the user to interact with.
MIT License
you may need to update the environment variables set in BE/.env
and FE/.env
.env
file in FE/
directory, you will find the url setup by default to localhostVITE_SERVER_URL = http://127.0.0.1:5000/
.env
file in BE/
directory, you will find also both HOST
and PORT
which are configured to localhost be defaultFLASK_RUN_HOST = 0.0.0.0
FLASK_RUN_PORT = 5000
Open a new terminal in the project directory
cd ./FE
npm install
npm run dev
cd ./BE
python3 -m venv .
source ./bin/activate
pip install -r ./requirements.txt
python app.py
You can fine-tune your own version and get it uploaded on hugging face using the following steps
we fine-tune llama2 using colab free resources of T4 GPU with 16 GB VRAM
we provided ./src/Fine-Tuning.ipynb
git clone https://github.com/nouralmulhem/Enhancing-LLM-with-Jenkins-Knowledge.git
drive is used to store the checkpoints just to ensure its persistance in case of colab enviornment crashes
you can edit the path to drive you want to save the model in by editting new_model_path
variable
you also can set the number of epochs you would like to use to fine-tune the model by updating num_train_epochs
variable
after getting done with fine-tuning the model you can access ./src/Upload_Model.ipynb
to merge lora weights with the model and upload your own model on hugging face and start using it
at this stage you need to update new_model_path
variable to the correct path on your drive
as a final step you need to update repo_id
variable to match your repo on hugging face
VOILA! you got your own model
You can load this full model onto the GPU and run it like you would any other hugging face model, but we are here to take it to the next level of running this model on the CPU.
we are using llama.cpp, so first of all we need to clone the repo
git clone https://github.com/ggerganov/llama.cpp.git
Llama.cpp has a script called convert_hf_to_gguf.py
that is used to convert models to the binary GGML format that can be loaded and run on CPU.
python convert_hf_to_gguf.py path/to/fine-tuned/model/ --outtype f16 --outfile path/to/binary/model.bin
This should output a 13GB binary file at the specified path/to/binary/model.bin
that is ready to run on CPU with the same code that we started with!
Part of the appeal of the GGML library is being able to quantize this 13GB model into smaller models that can be run even faster. There is a tool called quantize in the Llama.cpp repo that can be used to convert the model to different quantization levels.
First you need to build the tools in the Llama.cpp repository.
cd llama.cpp
cmake -B build
cmake --build build --config Release
This will create the tools in the bin directory. You can now use the quantize tool to shrink our model to q8_0 by running:
cd build/bin/release
./llama-quantize.exe path/to/binary/model.bin path/to/binary/merged-q8_0.bin q8_0
Now we have a 6.7 GB model at path/to/binary/merged-q8_0.bin
To upload the local quantized model on huggingface
huggingface-cli upload username/repo_id path/to/binary/quantized/model.bin model.bin
Note: This software is licensed under MIT License, See License for more information ©nouralmulhem.