Trending Programming Languages ranked by GitHub Users
GPL-3.0 License
This project finds trends in popularity amongst programming languages, by analyzing over 1.25 billion events from the public GitHub timeline and figuring out how many users each language has. See this blog post for an overview and discussion of the trends.
On April 10 2018 the rankings of each language by the number of active users on GitHub are:
The main data sources for this project are:
Analyzing programming language trends requires figuring out the language for each repository.
There are two ways to get language information out of the GitHub API. The first is to query for the Repository using the GET /repos/:owner/:rep API. This returns the dominant language of the repo. You can also get the byte breakdown of all the languages using the GET /repos/:owner/:repo/languages endpoint. This will include the number of bytes used by each language in the Repo.
We're using the single dominant language for the repo for this analysis. While this loses some information, there are multiple benefits that make this much more practical:
So the plan to get the language for each repo is to aggregate all these sources of information: The language scraped from the GitHub REST API, the language from the projects table of the GHTorrent project, the language included with certain Github Archive events and finally the language inferred from fork events (forks are assumed to have the same language as the repo they are forking)
Source | Repo Count |
---|---|
GHTorrent | 42.8M |
scraper | 22.5M |
GitHub-Archive | 13.5M |
forks | 37M |
There is significant overlap between all these sources of information, but once aggregated and deduplicated there we ended up with language information for 62.8 Million repos. This includes every repo that has had more than 1 user interact with it and all repos with only 1 user that have more than 5 total events ever. I'm still crawling the remaining repos.
The dates here are the dates corresponding to the events in GitHub Archive. This means that we are analyzing when something was pushed to GitHub rather than the commit date (which could be considerably earlier). The reason behind this is that the commit date can potentially be inaccurate: The dates are given from the developers and it's not uncommon to see dates that implausibly far back in the past (Jan 1, 1970), or even more implausibly occurring in the future.
This code requires both Go and Python to run properly. Additionally, this requires around 1TB of free disk space to run, and I would recommend at least 16GB of RAM.
To configure your system to run this code
pip install -r requirements.txt
and go dependences by running go get ./...
from this directorypsql github < schema.sql
There are multiple different components to this code.
The main programs written in Go are:
gha-download-files
: downloads new files from the githubarchive so that they can be analyzed locally.gha-parse-events
: Parses the JSON events from the Github Archive and converting to normalized TSV files. The JSON event schema changes several times over the last 7 years, and normalizing to a consistent TSV schema makes it much easier to analyze.gha-scraper
: Crawls repo information from the GitHub API and inserts into Postgres.There are also several small bash scripts that do the actual analysis:
scripts/calculate_language_mau.sh
: Joins the repo languages against the parsed events, and figures out the MAU for each language at every month.scripts/calculate_repo_languages.sh
: Merges information from postgres/ghtorrents/extracted GitHub archive events/ and from fork events to get a single repo:language mapping.scripts/calculate_top_repos.sh
: Ranks each repository by the number of users. The output of this is passed to gha-scraper to crawl repositories.Finally plotting is done with Python by running python scripts/plot.py
. This will also update the graphs in this README.
A future goal of this project is to simplify the steps needed to run this code, it's unnecessarily convoluted right now.