Skip to content

home

MIT License Python Python 3.9 EditorConfig Maturity badge - level 1

PatchScope – A Modular Tool for Annotating and Analyzing Contributions

Annotates files and lines of diffs (patches) with their purpose and type, and performs statistical analysis on the generated annotation data.

Note: this project was called 'python-diff-annotator' earlier in its history instead of 'PatchScope', and the python package was called 'diffannotator' instead of being called 'patchscope', so there are some references to that older name, for example in directory names in some Jupyter Notebooks.

You can find early version of the project documentation at https://ncusi.github.io/PatchScope/, but it is currently incomplete and unpolished.

Installation

Use the package manager pip to install patchscope.

To avoid dependency conflicts, it is strongly recommended to create a virtual environment first, activate it, and install patchscope into this environment. See also "Virtual environment" subsection below.

To install the most recent version, use

python -m pip install patchscope@git+https://github.com/ncusi/PatchScope#egg=main
or (assuming that you can clone the repository with SSH)
python -m pip install patchscope@git+ssh://git@github.com/ncusi/PatchScope.git#egg=main

Usage

Overview of tool components

This tool integrates four key components

  1. extracting patches from version control system or user-provided folders
    as separate step with diff-generate, or integrated into annotation step: diff-annotate
  2. applying specified annotation rules for selected patches
    using diff-annotate, which generates one JSON data file per patch
  3. generating configurable reports or summaries
    with various subcommands of diff-gather-stats; each summary is saved as a single JSON file
  4. advanced visualization with a web application (dashboard)
    which you can run it with panel serve, see the description below

Running scripts

This package installs scripts (currently three) that you can run to generate patches, annotate them, and extract their statistics. Every script name starts with the diff-* prefix.

Each script and subcommand supports the --help option.

  • diff-generate: used to generate patches (.patch and .diff files) from a given repository, in the format suitable for later analysis; not strictly necessary;

    Usage: diff-generate [OPTIONS] REPO_PATH [REVISION_RANGE...]
    (where REVISION_RANGE... is passed as arguments to the git log command)

  • diff-annotate: annotates existing dataset (patch files in subdirectories), or annotates selected subset of commits (of changes in commits) in the given repository;

    Usage: diff-annotate [OPTIONS] COMMAND [ARGS]...

    • diff-annotate patch [OPTIONS] PATCH_FILE RESULT_JSON: annotate a single PATCH_FILE, writing results to RESULT_JSON,
    • diff-annotate dataset [OPTIONS] DATASETS...: annotate all bugs in provided DATASETS,
    • diff-anotate from-repo [OPTIONS] REPO_PATH [REVISION_RANGE...]: create annotation data for commits from local Git repository (with REVISION_RANGE... passed as arguments to the git log command);
  • diff-gather-stats: compute various statistics and metrics from patch annotation data generated by the diff-annotate script;

    Usage: diff-gather-stats [OPTIONS] COMMAND [ARGS]...

    • diff-gather-stats purpose-counter [--output JSON_FILE] DATASETS...: calculate count of purposes from all bugs in provided datasets,
    • diff-gather-stats purpose-per-file [OPTIONS] RESULT_JSON DATASETS...: calculate per-file count of purposes from all bugs in provided datasets,
    • diff-gather-stats lines-stats [OPTIONS] OUTPUT_FILE DATASETS...: calculate per-bug and per-file count of line types in provided datasets,
    • diff-gather-stats timeline [OPTIONS] OUTPUT_FILE DATASETS...: calculate timeline of bugs with per-bug count of different types of lines;

You can find more information about the annotation process in "Annotation process" documentation.

Running web app (dashboard)

This package also includes web dashboard, created using the Panel framework. You would need to install additional dependencies, for example with pip install --editable .[web] (if you are running this project in editable mode).

Currently, it includes two web apps, namely Contributors Graph and Author Statistics. You can run each app with panel serve:

  • panel serve src/diffinsights_web/apps/contributors.py
  • panel serve src/diffinsights_web/apps/author.py

By default, it would make those apps available at http://localhost:5006/contributors and http://localhost:5006/author, respectively.

You can also run both of them at once with

  • panel serve src/diffinsights_web/apps/*.py

See the basic demo on Heroku:

You can find description of those two apps, with screenshots, at

Examples and demos

This repository also includes some examples demonstrating how this project works, and what it can be used for.

First time setup (for generating examples)

You can set up the environment for using this project, following the recommended practices (described in the "Development" section of this document), by running the examples-init.bash Bash script, and following its instructions.

Note that this script assumes that it is run on Linux, or Linux-like system. For other operating systems, you are probably better following the steps described in this document manually.

This script includes the configuration section at the beginning of it; you can change parameters to better fit your environment:

  • DVCSTORE_DIR - directory with local dvc remote
  • PYTHON - Python 3.x executable (before activating virtual environment)

This project uses DVC (Data Version Control) tool to track annotations and metrics data, and version this data. It allows to store large files and large directories outside of Git repository, while still have them to be version controlled. They can be stored locally, or in the cloud.

The examples-init.bash script also configures local DVC storage (see the next subsection).

Data pipeline with DVC

To provide reproducibility, and to make it possible to version data files separately from versioning the code, this project uses DVC (Data Version Control) tool for its examples.

DVC pipelines are versioned using Git, and allow you to better organize projects and reproduce complete workflows and results at will. The pipeline is defined in the dvc.yaml file.

You can re-run the whole pipeline, after installing DVC, with the dvc repro command. It will run only those pipeline stages that needed it, by examining if stage dependencies (defined in dvc.yaml) changed. The results are saved in DVC cache, and you can push them to DVC remote with dvc push, if you have one configured (the examples-init.bash script from previous subsection configures DVC remote with storage on the local filesystem).

Downloading data from DAGsHub

DAGsHub is a platform for AI and ML developers that lets you manage and collaborate on your data, models, experiments, alongside your code. Among other things, it can be used as DVC remote.

Here is fragment from .dvc/config that defines "dagshub" DVC remote to store data:

['remote "dagshub"']
    url = s3://dvc
    endpointurl = https://dagshub.com/ncusi/PatchScope.s3

Note that you need to install support for S3 for DVC to use this remote, see Example: installing DVC with support for Amazon S3 storage.

You can then download all data with dvc pull -r dagshub.

Alternatively, you can download data via DagsHub web interface, from https://dagshub.com/ncusi/PatchScope:

  • data/examples/ includes annotations and statistics for a few example repositories:
  • hellogitworld (only statistics, repository archived 2020),
  • qtile,
  • tensorflow (limited to top-2 non-bot authors),
  • linux (years 2021-2023)
  • data/experiments/ includes various pieces of data computed when comparing automatic annotations generated by PatchScope with manual annotations from BugsInPy subset of HaPy-Bugs dataset, and with manual annotations from Herbold et al. "A fine-grained data set and analysis of tangling in bug fixing commits" available in SmartSHARK.

Jupyter Notebooks

The notebooks/ directory contains Jupyter Notebooks with data exploration, data analysis, etc. See notebooks/README.md for details.

Development

Virtual environment

To avoid dependency conflicts, it is strongly recommended to create a virtual environment, for example with:

python -m venv .venv

This needs to be done only once, from top directory of the project. For each session, you should activate the environment:

source .venv/bin/activate

Using virtual environment, either directly like shown above, or by using pipx, might be required if you cannot install system packages, but Python is configured in a very specific way:

error: externally-managed-environment

× This environment is externally managed

Installing the package in editable mode

To install the project in editable mode (from top directory of this repo):

python -m pip install -e .

To be able to also run test, use:

python -m pip install --editable .[dev]

Running tests

This project uses pytest framework. Note that pytest requires Python 3.8+ or PyPy3.

To run tests, run the following command

pytest
or
python -m pytest

Roadmap

See TODO.md.

Here are some related projects that can also be used to extract development statistics from project or a repository.

Command line and terminal interface tools:

  • git-quick-stats Shell Script is a simple and efficient way to access various statistics in a git repository
  • git-stats JavaScript provides local git statistics, including GitHub-like contributions calendars
  • git_dash.sh Shell Script is a command-line shell script for generating a Git metrics dashboard directly in your terminal
  • heatwave Python visualizes your git commits with a heat map in the terminal, similar to how GitHub's heat map looks
  • statscat Go is a CLI tool to get statistics of your all git repositories
  • hxtools Perl by Jan Engelhardt is a collection of small tools and scripts, which include git-author-stat (commit author statistics of a git repository), git-blame-stat (per-line author statistics), and git-revert-stats (reverting statistics)
  • git-fame (in Python Python) and git-fame-rb (in Ruby Ruby) are command-line tools to pretty-print Git repository collaborators sorted by contributions
  • git-of-theseus Python is a set of scripts to analyze how a Git repo grows over time.
  • GitHub Linguist Ruby can also be used from the command line, using the github-linguist executable to generate repository's languages stats (the language breakdown by percentage and file size), also for selected revision
  • git-metrics Python tool is a set of util scripts to scrape data from git repositories to help teams improve (metrics such as lead time and open branches)

Tools to generate HTML dashboard, or providing an interactive web application:

Visualizations for a specific repository:

  • A Git history visualization page by Jeff Palmer shows "An Interactive Development History" of Git: project and contributor statistics, relative cumulative contributions by contributor, and aggregated commits by contributor by month with milestone annotations. Jeff wrote an associated blog post about how he created the visualization.
  • gitdm (the "git data miner") is the tool that Greg KH and Jonathan Corbet have used to create statistics on where kernel patches come from. Written in Python. Original at git://git.lwn.net/gitdm.git

Web applications that demonstrate some MSR tool:

  • GitHub offers GitHub Insights for repositories (see for example Contributors to qtile/qtile). This includes the following subpages:
    • Pulse (with configurable period of 1 month, 1 week, 3 days, 24 hours) shows information about pull requests and issues, and summary of changes as text (N authors pushed X commits to master, and Y to all branches. On master, M files were changed ad there had been A additions and D deletions).
    • Contributions per week to master, excluding merge commits {as smoothed (!) line/area plot}, for whole project, and for up to 100 authors (with configurable period of all, last month, last 3 months, last 6 months, last 12 months, last 24 months; with configurable type of contributions: commits, additions, deletions). For each author we also have summary of their contributions as text (N commits, A ++, D --).
    • Commits shows two plots: bar plot of commits per week over time for the last year {without any explanation, except for information shown on mouse hover}, and line plot with days of the week on x-axis {no explanation, no information on hover (!)}. No configuration.
    • Code frequency over the history of the project: additions and deletions per week (where additions use green solid lines, and deletions use red dashed lines and are plotted upside-down). No configuration.
    • other pages related to GitHub specifically, or the project as whole but not its history (like Community Standards, Dependency graph, Forks, or Action Usage Metrics).
  • GitHub also offers Developer Overview, which among others include the following chart:
    • N contributions in last year / in YYYY, showing heatmap using 5-color discrete colormap, with year worth of weeks on x-axis, and day of the week (Sun to Sat) on the y-axis. You can switch between the years with a "radio button" (though there is no 'last year' entry). Contributions are timestamped according to Coordinated Universal Time (UTC) rather than contributor's local time zone.
  • Assayo has a homepage with demo where you can provide the output of given Git CLI command in your repo to create the demo for your repo, and there is also view a demo with mock data. Written in JavaScript with React.
  • Githru has an interactive demo, where you can select one of the following two GitHub repositories to visualize: vuejs/vue and realm/realm-java. Written in JavaScript with React, D3, dagre.
  • GitVision, a 3D repository graph visualization tool, has a live demo with visualization for more than 20 repositories (ranging from tiny to large), and where you can visualize your own repository by uploading the result of running the GitVision script. The demo is written in JavaScript using Vue and deployed with Vite.
  • GitBug-Java, a reproducible Java benchmark of recent bugs (tool accompanying the GitBug-Java: A Reproducible Java Benchmark of Recent Bugs paper (on arXiv)), has web app visualizing the dataset. No source code for the web app; it seems to be in JavaScript using Angular, with the help of Chart.js and diff2html.
  • Defects4J Dissection is an open-source web app that presents data to help researchers and practitioners to better understand the Defects4J bug dataset. Includes table view (the default) and charts. It is the open-science appendix of "Dissection of a bug dataset: anatomy of 395 patches from Defects4J" paper. Written in Python and JavaScript, under MIT license.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT