ghtorrent

Experimental GitHub project analysis based on GHTorrent.

The goal of this project is to explore the use of GHTorrent to understand better the life of Open Source software projects.

It is in the initial conception phase so probably only useful to exchange ideas and early prototypes.

Feedback

Don't hesitate to open GitHub issues if you have any kind of problem or you want to feedback or exchange thoughts.

It is needed a machine with 16GB of RAM, 4 cores and 1TB of hard disk to execute all without issues.

You need also a running Elasticsearch and Kibana.

And you will need around 12h of compute time (<1h human time) to execute all processes.

The first step is to download GHTorrent data. The testing has been done with: mysql-2018-09-01.tar.gz.

Now, there are two analysis. The first open loading the projects table and the second one loading projects, users and commits.

To load the GitHub projects data in Elasticsearch:

ght_projects2es.py -e <elastic_url> -i ghtprojects --db-name ghtorrent_projects

You will need around 4h to complete the above steps.

The language detection is based on https://github.com/github/linguist

Compare with:

Import in MySQL the commits, projects and users tables: it has been tested teh loading of commits only for 2018 (220MM).
For doing that:

grep '"2018-' commits.csv > commits-2018.csv

You must load this commits-2018.csv 22GB file instead of the commits.csv which is (100GB).

ght_commits2es.py -e <elastic_url> -i ghtcommits --db-name ghtorrent_commits

You will need around 12h to complete the above steps.