examples
Example data - statistics extracted from patch annotations¶
The stats/
subdirectory contains JSON files generated with the diff-gather-stats
script and its various subcommand, processing the output of running
diff-annotate from-repo
command, found in annotations/
subdirectory.
Those files were generated by running the DVC pipeline
(which is defined in the /dvc.yaml
file), with the dvc repro
command.
Using DVC pipeline makes it possible to regenerate only those files that need it, and re-run all stages that need it.
Here is the graph of DVC pipeline stages, as Mermaid flowchart:
flowchart LR
node1["annotate@{0.a,0.b,1.c}"]
node3["clone@{0,1}"]
node4["purpose-counter@{0,1{"]
node5["purpose-per-file@{0,1}"]
node6["lines-stats@{0,1}"]
node7["timeline@{0,1}"]
node8["timeline.purpose-to-type@{0,1}"]
node1-->node4
node1-->node5
node1-->node6
node1-->node7
node1-->node8
node3-->node1
Variables in the DAG of DVC stages above:
- 0: tensorflow repository
- 0.a: ezhulenev@google.com author in tensorflow repository
- 0.b: yong.tang.github@outlook.com author in tensorflow repository
- 1: qtile repository
- 1.c: all authors in qtile repository, no merge commits
You can also see whole up-to-date interactive graph of stages and their dependencies at https://dagshub.com/ncusi/PatchScope#repo-graph-view.
Those files are being analyzed by Jupyter notebooks in the
/notebooks/
directory,
see /notebooks/README.md
.
Projects and repositories¶
The list of different example repositories taken into considerations is borrowed from the GitVision app demo site.
- Large repositories:
- TensorFlow: A comprehensive machine learning library by Google
This repo provides a great example of a large, complex open-source project with a very active community. - ...
Other repositories were selected by authors of this project:
- Qtile: A full-featured, hackable tiling window manager written and configured in Python
This repo is a medium-sized, but quite active project.
Repositories are cloned into ~/example_repositories/
.
On authors workstation this directory is a symbolic link to
/mnt/data/python-diff-annotator/example_repositories/
directory.
This operation can be done by running the "clone" stage of the DVC pipeline.
NOTE: all commands are assumed to be run from the top directory
of the project, not from its examples/stats/
subdirectory.
Generating annotation data (for 'tensorflow')¶
The annotation data for further processing is generated directly from the repo in the "flat" format. It was generated with the following command:
diff-annotate from-repo \
--output-dir=data/examples/annotations/tensorflow/ezhulenev/ \
~/example_repositories/tensorflow/ \
--author=ezhulenev@google.com
diff-annotate from-repo \
--output-dir=data/examples/annotations/tensorflow/yong.tang/ \
~/example_repositories/tensorflow/ \
--author=yong.tang.github@outlook.com
The "flat" format has the following structure:
<output_dir>/<commit_id>.json
.
Generating stats data (for 'tensorflow')¶
Statistics computed from annotations were saved in JSON files, one single file per different type of statistics.
-
tensorflow.purpose-per-file.json
(1.8 MB) was generated with the following command
using the "purpose-per-file" stage of DVC pipeline:diff-gather-stats --annotations-dir='' \ purpose-per-file \ data/examples/stats/tensorflow.purpose-per-file.json \ data/examples/annotations/tensorflow/
-
tensorflow.lines-stats.json
(9.8 MB) was generated with the following command
using the "lines-stats" stage of DVC pipeline:diff-gather-stats --annotations-dir='' \ lines-stats \ data/examples/stats/tensorflow.lines-stats.json \ data/examples/annotations/tensorflow/
-
tensorflow.timeline.json
(3.2 MB) was generated with the following command
using the "timeline" stage of DVC pipeline:diff-gather-stats --annotations-dir='' \ timeline \ data/examples/stats/tensorflow.timeline.json \ data/examples/annotations/tensorflow/
-
tensorflow.timeline.purpose-to-type.json
(3.2 MB) was generated with the following command
using the "timeline.purpose-to-type" stage of DVC pipeline:diff-gather-stats --annotations-dir='' \ timeline \ --purpose-to-annotation=data \ --purpose-to-annotation=documentation \ --purpose-to-annotation=markup \ --purpose-to-annotation=other \ --purpose-to-annotation=project \ --purpose-to-annotation=test \ data/examples/stats/tensorflow.timeline.purpose-to-type.json \ data/examples/annotations/tensorflow/