Documentation retrieval system to help LLMs navigate less-popular (yet often more powerful) Python libraries
APACHE-2.0 License
MongooseMiner is a search system that pushes LLM-based code generation beyond average human performance. Most LLMs for code generation write code like humans:
By evaluating the documentation strings of the most common PyPI projects and retrieving them as needed to guide LLM autocompletion, MongooseMiner can deliver the most appropriate and performant code.
To enable MongooseMiner, we needed both PyPi and GitHub data. BigQuery hosts both:
distribution_metadata
table contains other tables we need to fetch:
name
mapped to pypi_name
version
mapped to pypi_version
summary
& description
combined into a single pypi_description
stringhome_page
string & download_url
string & project_urls
array of strings where we can find the source code links and check if it leads to GitHub export to github_url
requires
for dependenciesfile_downloads
table contains columns:
project
like a8
sample_repos
table contains:
repo_name
string like FreeCodeCamp/FreeCodeCamp
watch_count
integer for the number of people watching the repolanguages
table contains:
repo_name
string like FreeCodeCamp/FreeCodeCamp
language.name
string like C
language.bytes
integer containing the amount of code written in that languageWe use that data to aggregate information into one table:
For details and the code check bigquery.sql
.