Open Source Ecosystems

russian answer bot for groups of vk (and not only)

this is ML-solution to create a bot for groups of vk social network (but you can easily create your own social network delegate). the bot's behavior is based on real users' messages sent earlier. currently the bot supports only russian language

this bot was created to assist me with answering the most frequent questions in vk group's messages (see xvii messenger for vk)

read further for more information about how it works

this project is in state of baseline

installation

step 0. cloning and setup

clone this repository using git

git clone https://github.com/TwoEightNine/xvii_admin_bot.git
cd xvii_admin_bot

then install and activate a virtual environment

sudo apt install python3-venv
python3.6 -m venv admin_bot_env
source admin_bot_env/bin/activate
pip install -r requirements.txt

in the root directory you should create file secret.py with some sensitive information. the file should be like this:

access_token = 'your token here'
no_fetch_users = [13371337228]

access_token is a token to access group messages (to obtain the token visit this page, do not forget to use scope=messages,offline). no_fetch_user is a list of users' ids. if you want to ignore messages from a user, put here his id

step 1. fetching messages

to fetch messages run

python3 fetcher.py --count COUNT --social SOCIAL [-h]

where COUNT is how many dialogs to fetch to get messages from, SOCIAL is which social network to use. request -h help to see which social networks are supported

the script will load messages into data/messages.csv

step 2. find clusters

to perform semi-automatic labelling here goes this step. fetched messages are being lemmatized and cleaned, then converted to tf-idf vectors. spectral clustering is used. to perform clustering run:

python3 clusterizer.py [--search] [--clusters_count CL_COUNT] --random_state RND_ST [-h]

where --search is an optional flag to perform search for better clusters count, --clusters_count is required to perform final clustering, defines preferred number of clusters, --random_state is a random int for better reproducibility

you may want to run search (with --search flag) to calculate clustering metrics for different number of clusters. in this case the script will print this information

after search you have already defined 'good' clusters count for your task. now run this script again but with --clusters_count YOUR_VALUE and the script will create data/model_explanation.txt with information about the most frequent words in every cluster. if you think that the result of clustering is not so good, you can rerun clustering with other number of cluster or other random state

using the data you are going to create classes.json in next format:

{
  "your_class_1": {
    "clusters": [3, 7, 11],
    "response": "your_response_for_class_1"
  },
  "your_class_2": {
    "clusters": [2],
    "response": "__UNREAD"
  },
  "your_class_3": {
    "clusters": [16],
    "response": "your_response_for_class_3"
  }
}

this will help to convert clusters into needed classes. using these classes the model will train.

clusters are indexes of clusters that matches your class response is an answer to user. this field may contain special markers like __UNREAD and __READ. in these cases the response will not be sent but the conversation will be left read (no answer needed) or unread (human attention needed)

all not mentioned clusters implicitly belong to class undefined with response __UNREAD

step 3. find and train a model

after you created classes.json you can start to search for and train a model to perform predictions.

to search execute

python3 modeller.py --search [--cv CV] [--sort_by METRIC]

where --search is an optional flag that indicates that you want to perform search (using sklearn's GridSearch), CV is how many k-folds to use in cross validation, METRIC is a metric alias to sort by

you can use default search params (like estimators and parameters) or define own in hyperparams.py (variable search_estimators)

after search you can see 5 best results (according to --sort_by) and explore all configurations in data/search_results.csv. best model should be set in hyperparams.py as final_estimator

to train a model run

python3 modeller.py [--cv CV] [--pca_n_components N_COM]

where N_COM is an argument for PCA()'s n_components value, if not set, PCA is not used

data/model_pipeline.pkl and data/model_classes.pkl will be created

optionally, you can interactively check the model using

python3 predictor.py

enter russian message and see which class the model thinks it belongs to

step 4. run and chill

the bot is ready to start. to launch it enter

python3 bot.py --social SOCIAL

in stdout you will see status messages, incoming messages and predicted answers

twoeightnine, 2020

Related Projects

DIY-Chatbot

🤖 A lightweight chatbot that knows nothing and has to learn from scratch. Using sqlite and python...

26 Nov 2018 16

NicePhoneme

Markov chains and statistical analysis tools for Facebook Chat.

01 Apr 2015 17

hype

Write Python functions. Call them from language models.

11 Sep 2024 16

learn-langchain

18 Apr 2023 263

SmartLMVocabs

Improving Language Model Performance through Smart Vocabularies

22 Nov 2018 6

chatify

Add ipython magic commands to Jupyter notebooks that provide LLM-driven enhancements

13 Apr 2023 21

languagemodels

Explore large language models in 512MB of RAM

07 May 2023 1,154

robot-agent

Fine-tuned LLaMa2 13B model designed for ReAct-style and Tree-Of-Thoughts style prompting.

15 Jul 2023 17

pytextclassifier

pytextclassifier is a toolkit for text classification. 文本分类，LR，Xgboost，TextCNN，FastText，TextRNN，B...

28 Apr 2017 482

Messenger-analysis

Messenger chat analyzer. Take a look at the in-depth study of your chat history.

03 May 2020 4

pytldr-oss

An open source, Gradio-based chatbot app that combines the best of retrieval augmented generation...

20 Jul 2024 29

xvii_admin_bot