Github Metadata Analytics

Points of Inquiry

In particular, we aim to address the following set of questions

Folder Structure

The following is the implied folder structure:

.
 DataFiles
    dataset.csv
    test.csv
    train-val.csv
    train.csv
    val.csv
 DataPreparation
    DataPreperation.ipynb
    Preprocess.py
    Visualize.py
 Questions
    D - Language Commonality
       LanguageCommonality.ipynb
       Logic.py
    D - Language Success
       LanguageSuccess.ipynb
       Logic.py
    D - License Prevalence
       LicensePrevalence.ipynb
       Logic.py
    D - Size & Contribution Effect
       Logic.py
       SizeAndContributionEffect.ipynb
    E - Arbitrary Language Predictors
       ArbitraryLanguagePredictors.ipynb
       Logic.py
    E - Contributions & Watchers
       Contributions & Watchers.ipynb
       Logic.py
    E - Database & Frameworks Correspondence
       Databases&Frameworks.ipynb
       Logic.py
    E - Language Associations
       Language Associations.ipynb
       Logic.py
    E - Licenses, Language & Size
       Licenses, Language & Size.ipynb
       Logic.py
    I - Generalizing Archival Trends
       Generalizing Archival Trends.ipynb
       Logic.py
    I - Generalizing Dynamic Typed Languages
       Generalizing Dynamic Typed Languages.ipynb
       Logic.py
    P - Expected Language Archivals
       ExpectedLanguageArchivals.ipynb
       Logic.py
       langs.txt
    P - Expected Python Contributions
        Expected Python Contributions.ipynb
        Logic.py
 README.md
 LICENSE
 Reports & Dashboard
 DS Project.pdf
 DS Proposal.pdf
 script.py
 utils.py

Standards

We have set the following set of working standards as we were undertaking the project. If you wish to contribute for any reason then please respect such standards.

Pipeline

We harnessed the data science cycle for each of the questions. This includes an epicycle that applies in each stage. As in the standards, each notebook corresponding to a question was structured into the 5 stages of the cycle. We also logged our iterations for the epicycle in each stage using a table under that stage in the notebook.

To optimize the cycle over different questions, we also employed a single data preparation stage to include most of the common required processing over different questions.

Running the Project

pip install requirements.txt
# To explore the cycle for any question, simply head to its notebook.

In the rest of the README, we will explore the data preparation stage, the cycle for 2 or 3 questions and some dashboards!

Data Preparation

Our dataset was Kaggle's Github Metadata dataset. An observation from the dataset is shown below

{
  "owner": "pelmers",
  "name": "text-rewriter",
  "stars": 13,
  "forks": 5,
  "watchers": 4,
  "isFork": false,
  "isArchived": false,
  "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
  "languageCount": 3,
  "topics": [ { "name": "chrome-extension", "stars": 43211 } ],
  "topicCount": 1,
  "diskUsageKb": 75,
  "pullRequests": 4,
  "issues": 12,
  "description": "Webextension to rewrite phrases in pages",
  "primaryLanguage": "JavaScript",
  "createdAt": "2015-03-14T22:35:11Z",
  "pushedAt": "2022-02-11T14:26:00Z",
  "defaultBranchCommitCount": 54,
  "license": null,
  "assignableUserCount": 1,
  "codeOfConduct": null,
  "forkingAllowed": true,
  "nameWithOwner": "pelmers/text-rewriter",
  "parent": null
}

Our data preparation module was used for all questions and supported the following:

Reading specific splits of the data (train, test, val)
Reading specific columns of the data (by name or type)
Breaking down composite columns
Deleting useless columns
Handling missing values
Handling outliers by multiple imputation
Extracting time features

Basic Feature Analysis

After using domain knowledge to handle missing values and using multiple imputation with stochastic gradient descent, we obtain the following violin plots

Myriad of other plots, statistics, insights for each and epicycle logging are present in the demonstration notebook which like all Github, should be viewed in dark mode.

Now let's have a cursory glance over some of the questions. Note that epicycle details and in-depth insights will rather be found in the corresponding notebook or the report.

License Prevalence

Stating Questions

What is the fraction of repositories without a license?
What fraction of those with licenses also have a code of conduct?

Exploratory Data Analytics

The Available Licenses

Top 10 Licenses

Model Building

Result Interpretation & Communicating Results

Arbitrary Language Predictors

Stating Questions

 Is there any association between the number of main branch commits, stars and pull requests, and
 the primary programming language used in a project?

Exploratory Data Analytics

There seems to be no precise distinctive association

Not even from a distribution prespective

Let's rather look for a high-level association

Model Building

Based on EDA for high-level association, we make the following claims:

I. Stars dont really differ from language to language

II. TypeScript can be regarded as the most active language

III. C can be regarded as the least collaborative language

Check whether CLT holds before proceeding with hypothesis testing

Test Claim I

Test Claim II

Test Claim III

JavaScript	Python	Undetected	Java	C++	TypeScript	PHP	C#	HTML
Accept: (C) - (JavaScript) < 0	Accept: (C) - (Python) < 0	Accept: (C) - (Undetected) > 0	Accept: (C) - (Java) < 0	Cannot Reject: (C) - (C++) > 0	Accept: (C) - (TypeScript) < 0	Accept: (C) - (PHP) < 0	Accept: (C) - (C#) < 0	Accept: (C) - (HTML) < 0

Result Interpretation & Communicating Results

Insights
We cannot predict the language given stars, pull requests, and commits. In other words, no strong or precise association
Languages seem to be equally successful as their stars are not significantly different on average
There is a high-level association for commits; for instance, TypeScript can be regarded as the most active language
There is a high-level association for pull requests; for instance, C (& C++) can be regarded as the least collaborative language

Expected Language Archivals

Stating Questions

What programming language is expected to have the most repos archived in 2023?

Exploratory Data Analytics

Archival Rate for Most Archived Language Every Year

Monthly Arhivals for C

Does not seem to be leaving us soon.

Model Building

Training a Time-series Forecasting Model per Language

Predicting for 2023 for each Language

Result Interpretation & Communicating Results

Insights
Assembly is expected to be the most archived language in 2023. It makes sense as it's one of the oldest languages around
Different languages seem to follow trends of different complexity. The trend is mostly decreasing for modern popular languages but stochastic for older ones
C++ and C seem to be safer than expected, which can be justified by their use in embedded systems, operating systems, and libraries for other languages
Niche languages like HCL and Solidity don't seem to be at risk, but they probably took a big hit earlier
The endangerment of languages like Ruby and Lua is expected. Lua has recently been listed in a popular list of the worst languages, and Ruby stopped shining after the rise of Python and JavaScript

This demos just 3 of the 13 questions; check the notebooks and the report for more!

Some Dashboards

Collaborators

Related Projects

devml

Product of Pragmatic AI Labs: Machine Learning, Statistics and Utilities around Developer Product...

09 Oct 2017 27

checklist

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

09 Mar 2020 2,002

Github-Metadata-Analytics

Github Metadata Analytics

Points of Inquiry

Folder Structure

Standards

Pipeline

Running the Project

Data Preparation

Basic Feature Analysis

License Prevalence

Stating Questions

Exploratory Data Analytics

The Available Licenses

Top 10 Licenses

Model Building

Result Interpretation & Communicating Results

Arbitrary Language Predictors

Stating Questions

Exploratory Data Analytics

There seems to be no precise distinctive association

Not even from a distribution prespective

Let's rather look for a high-level association

Model Building

Check whether CLT holds before proceeding with hypothesis testing

Test Claim I

Test Claim II

Test Claim III

Result Interpretation & Communicating Results

Expected Language Archivals

Stating Questions

Exploratory Data Analytics

Archival Rate for Most Archived Language Every Year

Monthly Arhivals for C

Model Building

Training a Time-series Forecasting Model per Language

Predicting for 2023 for each Language

Result Interpretation & Communicating Results

This demos just 3 of the 13 questions; check the notebooks and the report for more!

Some Dashboards

Collaborators

Related Projects

devml

checklist