Hyper-unspecific data parsing and graphing scripts

A simple Python tool/framework for generating statistics about code in a git repository.

...What?!

I like making random graphs about random stats (code in this case), but got tired of constantly writing new hyper-specific scripts for parsing and graphing that data.

This project aims to solve that by not making any assumptions about the file structure or similar things, and not requiring code changes to collect more stats.

Thanks to @nico for the name :^)

Requirements

Python 3.11 (see here)
braceexpand for nicer file glob patterns (pip install -r requirements.txt)
The following utilities need to be in PATH: git, scc (if used as an
analyzer)

Usage

Create a config file (TOML):

repository = "/path/to/project"
output = "output/stats.json"

[analyzers]

[analyzers.lines]
type = "scc"

[analyzers.fixmes_and_todos]
type = "grep"
regex = "fixme|todo"
case_insensitive = true

Then call the main.py script with it:

python3 main.py path/to/config.toml

output/stats.json will be created, containing a JSON blob that looks something like this:

[
  {
    "commit": "c048cf2004",
    "timestamp": 1679655523.0,
    "analyzers": {
      "lines": {
        "AK/AllOf.h": {
          "language": "C Header",
          "bytes": 848,
          "lines": 37,
          "code": 25,
          "comment": 5,
          "blank": 7,
          "complexity": 1
        }
        // ...
      },
      "fixmes_and_todos": {
        "Some/File.cpp": 42
        // ...
      }
    }
  }
  // ...
]

That's the hard part done! You can now do whatever you want with this data, here are some ideas...

Scripts

`export_csv.py`

Not everything likes JSON, but maybe CSV works! Takes all fields of each analyzer and puts them into individual files named <analyzer>.csv in the same directory as the input file.

$ python3 scripts/export_csv.py output/stats.json
$ ls output
fixmes_and_todos.csv lines.csv stats.json
$ cat output/fixmes_and_todos.csv | head -1
c048cf2004,2023-03-24 10:58:43,Some/File.cpp,42
$ cat output/lines.csv | head -1
c048cf2004,2023-03-24 10:58:43,AK/AllOf.h,C Header,848,37,25,5,7,1

This was written specifically with Postgres COPY in mind, and may or may not work for your use case.

`plot_grep_analyzer.py`

Barebones matplotlib script to plot the sum of grep analyzer results per commit over time.

python3 scripts/plot_grep_analyzer.py output/stats.json fixmes_and_todos

`plot_scc_analyzer.py`

Matplotlib script to plot a summary of scc results. All lines of each category (blank, comment etc.) are summed together across files.

python3 scripts/plot_scc_analyzer.py output/stats.json lines

Configuring the matplotlib plots

Since matplotlib is used by the plot_* scripts, the design of their plots can be controlled with a matplotlibrc file in the directory from which the script is invoked. For example, these parameters produce academically-looking plots (requires LaTeX):

text.usetex: True
font.size: 17
font.family: serif
axes.grid: True
axes.grid.axis: "y"
axes.grid.which: "both"

Config options

Group	Name	Description	Default
Top-level	`repository`	Path to input git repository	required
	`output`	Path to output JSON file	required
	`processes`	Number of processes to use at the same time	number of CPU cores
	`commit_sampling`	`all`: Don't skip any commits	`all`
		`last`: Only analyze the last commit, useful if you want to generate stats
		`daily`: Only analyze the first commit on each day (in UTC), useful for large projects or if you're only interested in a general trend, not per-commit stats
`[cache]`	`directory`	Output directory for caching per-commit results. Massively speeds up subsequent runs. Caching is based on the current configuration, so changes to an analyzer's configuration will invalidate only its own cache.	none
`[logging]`	`level`	Python logging level	`INFO`
`[analyzers.<name>]`	`type`	Type of this analyzer (see below)	required

Analyzers

`grep`

Count occurences of strings or regular expressions across all files. Does not actually use grep(1).

Name	Description	Default
`files_glob`	Glob pattern to run this analyzer on a subset of files	none
`regex`	Regular expression pattern to look for, make sure to escape properly	required
`case_insensitive`	Whether to compile the pattern with `re.IGNORECASE`	false

`grep_commits`

Count occurrences of strings or regular expressions in commit messages.

Name	Description	Default
`regex`	Regular expression pattern to look for, make sure to escape properly	required
`case_insensitive`	Whether to compile the pattern with `re.IGNORECASE`	false

`scc`

Uses scc(1) to generate per-file line count stats.

Name	Description	Default
`files_glob`	Glob pattern to run this analyzer on a subset of files	none

TODO

Tests, more analyzers.

License

MIT.

Related Projects

NonSteamLaunchers-On-Steam-Deck

Installs the latest GE-Proton and Installs Non Steam Launchers under 1 Proton prefix folder and a...

27 Apr 2023 2,025

LibCST

A concrete syntax tree parser and serializer library for Python that preserves many aspects of Py...

06 Aug 2019 1,394

wily

A Python application for tracking, reporting on timing and complexity in Python code

11 Oct 2018 1,203

bin-utils

Utility scripts / apps

10 May 2017 7

statys

📊 Statistical analyzers to provide more robust comparisons between Machine Learning techniques.

15 Sep 2020 5

nlp_data_visualization

This project provides Python scripts for analyzing and visualizing text data using efficient NLP ...

31 Jul 2024 0

docstr_coverage

Docstring coverage analysis and rating for Python

02 Aug 2018 92

git-py-stats

🐍📊 Git Py Stats is a Python-powered version of Git Quick Stats that provides a streamlined and cr...

22 Sep 2024 1

autopep8

A tool that automatically formats Python code to conform to the PEP 8 style guide.

29 Dec 2010 4,530

Project_CodeNet

This repository is to support contributions for tools for the Project CodeNet dataset hosted in DAX

03 May 2021 1,536

awesome-ci

Awesome Continuous Integration - Lot's of tools for git, file and static source code analysis.

15 Aug 2016 328

vardbg

A simple Python debugger and profiler that generates animated visualizations of program flow, use...

13 Jan 2020 1,082

hyper-unspecific-data-parsing-and-graphing-scripts