💡Document-Instruct generates tailored instructions to rapidly adapt models to new docs.
MIT License
data collection and preprocessing can be time-consuming and resource-intensive tasks. Topic2Dataset ia an open source project that helps by automating dataset generation from specified topics and websites, eliminating manual data gathering and accelerating dataset creation timelines. This project mainly aims to curate datasets for Retrieval augmented Generation (RAG) and LLM finetuning.
Demo Notebook:
Clone the repository
git clone https://github.com/adithya-s-k/topic2dataset
cd topic2dataset
Install Required Dependencies
pip install -r requirements.txt
Quick Start dataset generation
python generate.py --topic Clinical Trials --website ["https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/" , "https://clinregs.niaid.nih.gov/#"] --time_in_hrs 10 --output local
Topic2Dataset is open source. We welcome contributions and collaboration from the community! See the project page for ways to contribute.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use Topic2Dataset in your research, please cite the following paper:
@article{Topic2Dataset,
title={Topic2Dataset: AI-Drive Dataset curation for RAG and Finetuning},
author={Adithya S K},
year={2023}
}
The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced by any version of WizardLM is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.