CCA GPT Model Trainer

This repository contains a script created specifically for my employer, designed with MediaWiki and Jira Cloud in mind. The tool automates the process of fetching, cleaning, and combining data from MediaWiki and Jira, and then fine-tuning a language model (GPT) on the combined dataset. Currently, all settings are hardcoded in the Python code, but I will be changing that soon. The script is optimized to handle GPU memory constraints and can switch to CPU if needed.

Requirements

Python 3.x
Required Python packages (install using pip install -r requirements.txt):
- art
- colorama
- MySQLdb
- beautifulsoup4
- requests
- transformers
- torch
- datasets

Usage

Clone the repository:

git clone https://github.com/yourusername/cca-gpt-model-trainer.git
cd cca-gpt-model-trainer

Install dependencies:
```
pip install -r requirements.txt
```

Update settings in the script:

Open the cca-gpt-model-trainer.py script and update the following settings:

MySQL Database Connection:

connection = MySQLdb.connect(
    host="localhost", user="grahf", password="<password>", database="local_wiki"
)

Change host, user, password, and database to match your MediaWiki database credentials.

Jira API Connection:

url = "https://site.atlassian.net/rest/api/3/search"  # CHANGE THIS TO APPROPRIATE JIRA URL
params = {
    "jql": "project = CSS",  # CHANGE THIS TO APPROPRIATE PROJECT CODE
    "maxResults": 3000,
    "fields": "summary,description,comment",
}

email = "[email protected]"  # CHANGE THIS TO APPROPRIATE JIRA USER
api_token = "<token>"  # ADD JIRA TOKEN

Change the url, params['jql'], email, and api_token to match your Jira Cloud instance and credentials.

Run the script:
```
python cca-gpt-model-trainer.py
```

Script Functions

fetch_mediawiki_data(): Fetches MediaWiki data and saves it to a text file. Ensure you update the database connection settings.
clean_mediawiki_data(): Cleans the MediaWiki data by removing HTML tags and other unnecessary content.
fetch_jira_data(): Fetches Jira entries using the Jira REST API and saves them to a JSON file. Ensure you update the Jira connection settings.
combine_files(jira_file, mediawiki_file, combined_file): Combines Jira and MediaWiki data into a single text file for training.
download_tokenizer_files(model_name, output_dir): Downloads the tokenizer files for the specified model.
fine_tune_gpt_model(data_file, output_dir): Fine-tunes a GPT model on the combined data.

Future Improvements

Externalize settings to a configuration file to avoid hardcoding values in the script.
Add logging for better traceability and debugging.
Improve error handling and retry mechanisms.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Dean Thomson (grahfmusic) - GitHub

Related Projects

aecc

AeCC: Autoencoders for Compressed Communication

24 Nov 2023 3

automl-gs

Provide an input CSV and a target field to predict, generate a model + code to run it.

13 Jan 2019 1,842

JaraConverse-TransformersBased

This JaraConverse model is a cutting-edge Transformer-based supervised Language Model (LLM) speci...

02 Aug 2024 5

Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

02 Jul 2021 1,323

ai_readme_generator

AI Readme Generator reads any Git repository and suggests a README.md or a Pytest-based test file...

01 Jul 2023 7

TensorFlow-Course

Simple and ready-to-use tutorials for TensorFlow

02 Oct 2018 16,403

DocsGPT

GPT-powered chat for documentation, chat with your documents

02 Feb 2023 14,124

dfgo

Differentiable Factor Graph Optimization for Learning Smoothers @ IROS 2021

16 Aug 2021 78

Enhancing-LLM-with-Jenkins-Knowledge

🚀 this project aims to develop an app using an existing open-source LLM with data collected for d...

20 May 2024 10

CodeAssist

CodeAssist is an advanced code completion tool that provides high-quality code completions for Py...

09 Feb 2022 54

ai-github-maintainer

Fixing Issues & Making PRs

09 Sep 2024 1

OpenChatKit

03 Mar 2023 9,003

qxresearch-event-1

Python hands on tutorial with 50+ Python Application (10 lines of code) By @xiaowuc2

27 Aug 2020 1,379

instructor

structured outputs for llms

14 Jun 2023 5,518

RAGTune

Tuning and Evaluation of RAG pipeline. (Automated optimization to be added soon)

04 Mar 2024 260