Experimental GItHub project analysis based on GHTorrent
Experimental GitHub project analysis based on GHTorrent.
The goal of this project is to explore the use of GHTorrent to understand better the life of Open Source software projects.
It is in the initial conception phase so probably only useful to exchange ideas and early prototypes.
Don't hesitate to open GitHub issues if you have any kind of problem or you want to feedback or exchange thoughts.
It is needed a machine with 16GB of RAM, 4 cores and 1TB of hard disk to execute all without issues.
You need also a running Elasticsearch and Kibana.
And you will need around 12h of compute time (<1h human time) to execute all processes.
The first step is to download GHTorrent data. The testing has been done with: mysql-2018-09-01.tar.gz.
Now, there are two analysis. The first open loading the projects table and the second one loading projects, users and commits.
To load the GitHub projects data in Elasticsearch:
ght-restore-mysql-projects
ght_projects2es.py
ght_projects2es.py -e <elastic_url> -i ghtprojects --db-name ghtorrent_projects
You will need around 4h to complete the above steps.
The language detection is based on https://github.com/github/linguist
Compare with:
grep '"2018-' commits.csv > commits-2018.csv
You must load this commits-2018.csv
22GB file instead of the commits.csv which is (100GB).
ght_commits2es.py
ght_commits2es.py -e <elastic_url> -i ghtcommits --db-name ghtorrent_commits
You will need around 12h to complete the above steps.