CMP23 Data Science Project Repository
MIT License
In particular, we aim to address the following set of questions
The following is the implied folder structure:
.
DataFiles
dataset.csv
test.csv
train-val.csv
train.csv
val.csv
DataPreparation
DataPreperation.ipynb
Preprocess.py
Visualize.py
Questions
D - Language Commonality
LanguageCommonality.ipynb
Logic.py
D - Language Success
LanguageSuccess.ipynb
Logic.py
D - License Prevalence
LicensePrevalence.ipynb
Logic.py
D - Size & Contribution Effect
Logic.py
SizeAndContributionEffect.ipynb
E - Arbitrary Language Predictors
ArbitraryLanguagePredictors.ipynb
Logic.py
E - Contributions & Watchers
Contributions & Watchers.ipynb
Logic.py
E - Database & Frameworks Correspondence
Databases&Frameworks.ipynb
Logic.py
E - Language Associations
Language Associations.ipynb
Logic.py
E - Licenses, Language & Size
Licenses, Language & Size.ipynb
Logic.py
I - Generalizing Archival Trends
Generalizing Archival Trends.ipynb
Logic.py
I - Generalizing Dynamic Typed Languages
Generalizing Dynamic Typed Languages.ipynb
Logic.py
P - Expected Language Archivals
ExpectedLanguageArchivals.ipynb
Logic.py
langs.txt
P - Expected Python Contributions
Expected Python Contributions.ipynb
Logic.py
README.md
LICENSE
Reports & Dashboard
DS Project.pdf
DS Proposal.pdf
script.py
utils.py
We have set the following set of working standards as we were undertaking the project. If you wish to contribute for any reason then please respect such standards.
We harnessed the data science cycle for each of the questions. This includes an epicycle that applies in each stage. As in the standards, each notebook corresponding to a question was structured into the 5 stages of the cycle. We also logged our iterations for the epicycle in each stage using a table under that stage in the notebook.
To optimize the cycle over different questions, we also employed a single data preparation stage to include most of the common required processing over different questions.
pip install requirements.txt
# To explore the cycle for any question, simply head to its notebook.
In the rest of the README, we will explore the data preparation stage, the cycle for 2 or 3 questions and some dashboards!
Our dataset was Kaggle's Github Metadata dataset. An observation from the dataset is shown below
{
"owner": "pelmers",
"name": "text-rewriter",
"stars": 13,
"forks": 5,
"watchers": 4,
"isFork": false,
"isArchived": false,
"languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
"languageCount": 3,
"topics": [ { "name": "chrome-extension", "stars": 43211 } ],
"topicCount": 1,
"diskUsageKb": 75,
"pullRequests": 4,
"issues": 12,
"description": "Webextension to rewrite phrases in pages",
"primaryLanguage": "JavaScript",
"createdAt": "2015-03-14T22:35:11Z",
"pushedAt": "2022-02-11T14:26:00Z",
"defaultBranchCommitCount": 54,
"license": null,
"assignableUserCount": 1,
"codeOfConduct": null,
"forkingAllowed": true,
"nameWithOwner": "pelmers/text-rewriter",
"parent": null
}
Our data preparation module was used for all questions and supported the following:
After using domain knowledge to handle missing values and using multiple imputation with stochastic gradient descent, we obtain the following violin plots
Myriad of other plots, statistics, insights for each and epicycle logging are present in the demonstration notebook which like all Github, should be viewed in dark mode.
Now let's have a cursory glance over some of the questions. Note that epicycle details and in-depth insights will rather be found in the corresponding notebook or the report.
What is the fraction of repositories without a license?
What fraction of those with licenses also have a code of conduct?
Is there any association between the number of main branch commits, stars and pull requests, and
the primary programming language used in a project?
Based on EDA for high-level association, we make the following claims:
I. Stars dont really differ from language to language
II. TypeScript can be regarded as the most active language
III. C can be regarded as the least collaborative language
JavaScript | Python | Undetected | Java | C++ | TypeScript | PHP | C# | HTML |
---|---|---|---|---|---|---|---|---|
Accept: (C) - (JavaScript) < 0 | Accept: (C) - (Python) < 0 | Accept: (C) - (Undetected) > 0 | Accept: (C) - (Java) < 0 | Cannot Reject: (C) - (C++) > 0 | Accept: (C) - (TypeScript) < 0 | Accept: (C) - (PHP) < 0 | Accept: (C) - (C#) < 0 | Accept: (C) - (HTML) < 0 |
Insights |
---|
We cannot predict the language given stars, pull requests, and commits. In other words, no strong or precise association |
Languages seem to be equally successful as their stars are not significantly different on average |
There is a high-level association for commits; for instance, TypeScript can be regarded as the most active language |
There is a high-level association for pull requests; for instance, C (& C++) can be regarded as the least collaborative language |
What programming language is expected to have the most repos archived in 2023?
Does not seem to be leaving us soon.
Insights |
---|
Assembly is expected to be the most archived language in 2023. It makes sense as it's one of the oldest languages around |
Different languages seem to follow trends of different complexity. The trend is mostly decreasing for modern popular languages but stochastic for older ones |
C++ and C seem to be safer than expected, which can be justified by their use in embedded systems, operating systems, and libraries for other languages |
Niche languages like HCL and Solidity don't seem to be at risk, but they probably took a big hit earlier |
The endangerment of languages like Ruby and Lua is expected. Lua has recently been listed in a popular list of the worst languages, and Ruby stopped shining after the rise of Python and JavaScript |