A dataset to extract container metadata from Github Dockerfiles (under development)
Here we will put together a dataset that attemps to extract Dockerfile and associated metadata from Github repositories. Specifically, the metadata should include text from a README.md (or similar) that might describe the repository. We take the following steps:
We start with the script 0.find-github.py that has sections that do the following:
github/github-org-<orgname>.json
and github/github-org-<orgname>.pkl
Since I only needed a reasonably sized subset, I stopped at index 7090 of the organizations list when I had extracted lists of Dockerfiles for 1071 organizations. This is in addition to the first 1000 returned by the general search.To retrieve data for the repositories (meaning the Dockerfiles and metadata) I then parsed through the organization Dockerfile results, and the original 1000 general results. For each, we create a subfolder under data that is organized by the lowercase first letter of the Github organization, and then within has a folder hierarchy for the Github organization name and the repository name. Within each folder we save the Dockerfile(s) and a single text file with combined README.md (and similar) extracted from the repository. This step was performed by 1.github-extract.py.