Skip to content

examples

Example data - statistics extracted from patch annotations

The stats/ subdirectory contains JSON files generated with the diff-gather-stats script and its various subcommand, processing the output of running diff-annotate from-repo command, found in annotations/ subdirectory. Those files were generated by running the DVC pipeline (which is defined in the /dvc.yaml file), with the dvc repro command.

Using DVC pipeline makes it possible to regenerate only those files that need it, and re-run all stages that need it.

Here is the graph of DVC pipeline stages, as Mermaid flowchart:

flowchart LR
        node1["annotate@{0.a,0.b,1.c}"]
        node3["clone@{0,1}"]
        node4["purpose-counter@{0,1{"]
        node5["purpose-per-file@{0,1}"]
        node6["lines-stats@{0,1}"]
        node7["timeline@{0,1}"]
        node8["timeline.purpose-to-type@{0,1}"]
        node1-->node4
        node1-->node5
        node1-->node6
        node1-->node7
        node1-->node8
        node3-->node1
Variables in the DAG of DVC stages above: - 0: tensorflow repository - 0.a: ezhulenev@google.com author in tensorflow repository - 0.b: yong.tang.github@outlook.com author in tensorflow repository - 1: qtile repository - 1.c: all authors in qtile repository, no merge commits

You can also see whole up-to-date interactive graph of stages and their dependencies at https://dagshub.com/ncusi/PatchScope#repo-graph-view.

Those files are being analyzed by Jupyter notebooks in the /notebooks/ directory, see /notebooks/README.md.

Projects and repositories

The list of different example repositories taken into considerations is borrowed from the GitVision app demo site.

  • Large repositories:
  • TensorFlow: A comprehensive machine learning library by Google
    This repo provides a great example of a large, complex open-source project with a very active community.
  • ...

Other repositories were selected by authors of this project: - Qtile: A full-featured, hackable tiling window manager written and configured in Python
This repo is a medium-sized, but quite active project.

Repositories are cloned into ~/example_repositories/. On authors workstation this directory is a symbolic link to /mnt/data/python-diff-annotator/example_repositories/ directory.

This operation can be done by running the "clone" stage of the DVC pipeline.

NOTE: all commands are assumed to be run from the top directory of the project, not from its examples/stats/ subdirectory.

Generating annotation data (for 'tensorflow')

The annotation data for further processing is generated directly from the repo in the "flat" format. It was generated with the following command:

diff-annotate from-repo \
  --output-dir=data/examples/annotations/tensorflow/ezhulenev/ \
  ~/example_repositories/tensorflow/ \
  --author=ezhulenev@google.com
and
diff-annotate from-repo \
  --output-dir=data/examples/annotations/tensorflow/yong.tang/ \
  ~/example_repositories/tensorflow/ \
  --author=yong.tang.github@outlook.com
both using the "annotate" stage of DVC pipeline.

The "flat" format has the following structure: <output_dir>/<commit_id>.json.

Generating stats data (for 'tensorflow')

Statistics computed from annotations were saved in JSON files, one single file per different type of statistics.

  • tensorflow.purpose-per-file.json (1.8 MB) was generated with the following command
    using the "purpose-per-file" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      purpose-per-file \
      data/examples/stats/tensorflow.purpose-per-file.json \
      data/examples/annotations/tensorflow/
    
  • tensorflow.lines-stats.json (9.8 MB) was generated with the following command
    using the "lines-stats" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      lines-stats \
      data/examples/stats/tensorflow.lines-stats.json \
      data/examples/annotations/tensorflow/
    
  • tensorflow.timeline.json (3.2 MB) was generated with the following command
    using the "timeline" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      timeline \
      data/examples/stats/tensorflow.timeline.json \
      data/examples/annotations/tensorflow/
    
  • tensorflow.timeline.purpose-to-type.json (3.2 MB) was generated with the following command
    using the "timeline.purpose-to-type" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      timeline \
      --purpose-to-annotation=data \
      --purpose-to-annotation=documentation \
      --purpose-to-annotation=markup \
      --purpose-to-annotation=other \
      --purpose-to-annotation=project \
      --purpose-to-annotation=test \
      data/examples/stats/tensorflow.timeline.purpose-to-type.json \
      data/examples/annotations/tensorflow/