Annotation process
Annotation process¶
Installing the patchscope
package installs scripts (currently three)
that you can run to generate patches, annotate them, and extract their statistics.
Every script name starts with the diff-*
prefix.
Each script and subcommand (for those scripts that have multiple subcommands)
support the --help
option.
diff-generate
: used to generate patches (*.patch
and*.diff
files) from a given repository, in the format suitable for later analysis;diff-annotate
: annotates existing dataset (patch files in subdirectories), or annotates selected subset of commits (of changes in commits) in the given Git repository, producing annotation data (*.json
files);diff-gather-stats
: compute various statistics and metrics from patch annotation data generated by thediff-annotate
script, usually producing a single*.json
file.
Patches source¶
There are two possible sources of patches (unified diffs) to annotate: existing patch on disk, or git repository.
Patches and patch datasets¶
The first possible data source is a patch, or series of patches (a dataset),
saved as a *.diff
file on disk. This can be some pre-existing dataset, like
BugsInPy1 - one with patches stored as files on disk
(instead of being stored in some database). It can also be a result of running
the diff-generate
script.
Two of the diff-annotate
subcommand support this type of the diff data source:
diff-annotate patch [OPTIONS] PATCH_FILE RESULT_JSON
: annotate a single PATCH_FILE, writing results to RESULT_JSON,diff-annotate dataset [OPTIONS] DATASETS...
: annotate all bugs in provided DATASETS,
The patch
subcommand of the diff-annotate
command (script) is mainly
intended as a helper in examining and debugging the annotation process
and the annotation format.
The dataset
subcommand is, instead, meant to annotate existing dataset
of patches, that is set of patches in specific directory or directories.
You can annotate more than one dataset (directory) at once, though they need to have the same internal structure. By default, each dataset is expected to be an existing directory with the following path structure:
/ /patches/ .diff
This directory structure follows the structure used by the
BugsInPy1 dataset. You can change the /patches/
part with the --patches-dir
option, or eliminate it all together.
For example, with --patches-dir=''
, the diff-annotate
script
would expect data to have the following structure:
/ / .diff
Each dataset can consist with one or more bugs, each bug should include
at least one *.diff
file to annotate.
By default, annotation data is saved beside patches, in the same directory structures, as JSON files - one file per patch / diff:
/ /annotation/ .json
You can change the /annotation/
part with the --annotations-dir
option.
You can also make diff-annotate
save annotation data in a separate
place provided with the --output-prefix
option (but with the same
directory structure as mentioned above).
Git repository¶
Another source of patches to annotate are selected commits from the local git repository.
One of diff-annotate
subcommand support this type of the diff data source:
diff-anotate from-repo [OPTIONS] REPO_PATH [REVISION_RANGE...]
: create annotation data for commits from local Git repository (withREVISION_RANGE...
passed as arguments to thegit log
command)
There is one required option: --output-dir
, which you need to provide
for diff-annotate from-repo
to know where to store the annotation data.
By default, the output JSON files are stored as:
/ .json
You can, if you want, create layout like the one in BugsInPy1
(for example, when reproducing whole diffs instead of using simplified diffs
provided as *.diff
files by the BugsInPy dataset). If you expect many commits,
you can use --use-fanout
flag to limit the number of files stored in
a single directory.
You can select which commits you want to annotate by providing appropriate
<revision-range> argument, which is passed to git log
. If not provided,
it defaults to HEAD
(that is, the whole history of the current branch).
For a complete list of ways to spell <revision-range>, see the
"Specifying Ranges" section of the gitrevisions(7) manpage.
Here are a few git log
options that are often used with
diff-annotate from-repo
:
--max-parents=1
, or--no-merges
is probably the most commonly used option; it limits commits to those with maximum of 1 parent, dropping merge commits (which as of PatchScope 0.4.1 are using the--diff-merges=first-parent
to generate unified diff out of merge, comparing it against first parent, and often generating very large diff encompassing all changes made on merged in branch),--min-parents=1
can be used to drop root commits (rarely used),--author=<email>
or--author=<name>
can be used to limit to commits authored (created) by a given author,<tag1>..<tag2>
revision range specifier (like e.g.v6.8..v6.9
) to select all changes between two tagged revisions, perhaps together with--ancestry-path
option,- date range with
--after=<date>
and/or--before=<date>
(like e.g.--after=2021.01.01 --before=2023.12.31
) to select all changes available from default / current / selected branch that were (note: stopping at the first commit which is older than a specific date, unless you use--since-as-filter=<date>
instead of--since
or--after
).
To reproduce (redo) an existing bug dataset, provided that we know ids
of bug-fixing commits, can be currently (as of 0.4.1) done with
the following git log
option:
--no-walk <commit1> <commit2>...
to only process selected commits, but do not traverse their ancestors.
File language detection¶
Next step in the annotation process is detecting the language and purpose of each changed file in a diff. This is done with the help of the GitHub Linguist, which is the library is used on GitHub.com to detect blob languages, ignore binary or vendored files, suppress generated files in diffs, and generate language breakdown graphs.
To be more exact, language detection and file purpose detection is done
by a custom code that uses language.yml
file from GitHub Linguist (or rather its local copy), and some custom rules.
This code is more limited than GitHub Linguist, as it (currently, as of 0.4.1) uses only the pathname of the file (extension, basename, directory it is in), and it does not examine its contents (which, in the case of patches and datasets might be not available; in the case of repositories, it would have to be retrieved from specific commit in the repository).
On the other hand, PatchScope's custom rules provide more types of file purpose than GitHub Linguists' data, programming, markup, prose, or nil. Those new purposes include project, documentation, test, or unknown.
File purpose of changed file can and is later used to annotate changed line in that file with type of the changed line.
The result of file language detection is the following mapping (dict):
{
"language": language, # e.g. "Markdown"
"type": filetype, # e.g. "prose"
"purpose": file_purpose # e.g. "documentation"
}
Using PyLinguist (optional)¶
GitHub Linguist is a Ruby library, which makes it difficult to integrate with Python-based PatchScope. There is, however, a Python clone of github/linguist. The version available from PyPI is not the most up to date; you can install the retanoj/linguist fork of original douban/linguist Python clone (itself fork of liluo/linguist) with
pip install patchscope[pylinguist]
pip install linguist@git+https://github.com/retanoj/linguist#egg=master
As of PatchScope version 0.4.1, using PyLinguist, which you can turn on
with --use-pytlinguist
global option of the diff-annotate
script, still
only uses pathname of changed file.
Note that PyLinguist does not keep its languages.yml
up to date, that is
why by default diff-annotate
will make it use its own copy of the file.
You can turn it off with --no-update-languages
flag.
All language detection code is in diffannotator.languages
module
(which is created from the src/diffannotator/languages.py
file).
Build-in custom rules¶
The custom rules for language and purpose detection of changed files is contained in 3 global variables and 2 functions.
Both FILENAME_TO_LANGUAGES
and EXT_TO_LANGUAGES
are used to augment
and override what GitHub Linguist' languages.yml
detects as language
of the changed file, or provide language when languages.yml
does not
detect it.
For example, languages.yml
detects README.me
, read.me
, readme.1st
as "Text" files with purpose (called 'type' by GitHub Linguist) of "prose",
but not the plain README
file. The FILENAME_TO_LANGUAGES
variable
handles this case.
On the other hand, the ".md" extension is assigned by languages.yml
to both "Markdown" (type: prose), and "GCC Machine Description" (type: programming).
The EXT_TO_LANGUAGES
variable contents is used to break this tie
in favor of "Markdown".
There is yet another source of custom rules for finding file language,
namely the languages_exceptions()
function that takes file path of the
changed file in the repository, and file language determined so far, and
determines file language (same as determined so for, or changed).
Beside language name, languages.yml
also provides the type
field,
which diff-annotate
script presents as file purpose. Here,
the PATTERN_TO_PURPOSE
variable can be used to augment or override
the data from languages.yml
.
For example, it defines patterns for "project" files (like *.cmake
or requirements.txt
), something that is missing from the list of possible
file types in languages.yml
.
Note that as of PatchScope version 0.4.1, pattern matching (which uses shell wildcards) is done using
PurePath.match
method from the Python pathlib standard library, with its limitations. Currently the recursive wildcard “*” acts like non-recursive “”. This may change in the future.
There is yet another source of custom rules for finding file purpose,
namely the _path2purpose()
static method in Languages
class
in diffannotator.languages
module. It is this method that finds
"test" files, and which translates GitHub Linguist's "prose" file type
to PatchScope's "documentation" file purpose.
Configuration from command line¶
Both finding file language and finding file purpose is configurable from the command line.
--ext-to-language
defines mapping from extension to file language.
Examples of use include--ext-to-language=.rs:Rust
and--ext-to-language=".S:Unix Assembly"
, or--ext-to-language=.cgi:Perl
(the last one is a project-specific rule).--filename-to-language
defines mapping from filename to file language.
Examples of use include--filename-to-language=changelog:Text
,--filename-to-language=config.mak.in:Makefile
, etc.--pattern-to-purpose
defines mapping from filename pattern to that file purpose.
Examples of use include--pattern-to-purpose=Makefile.*:project
,--pattern-to-purpose=*.dts:data
, and--pattern-to-purpose=*.dts?:data
.
In all those options, empty value resets mapping.
Tokenizing changes¶
Next step involves using a lexer or a parser for changes or for a whole changed file. Currently (as of version 0.4.1) the only supported lexer is the Pygments syntax highlighting library in Python.
The code that runs the lexer can be found in diffannotator.lexer
module,
which is created from the src/diffannotator/lexer.py
source code file.
This code is responsible for selecting and caching lexers, handling errors,
etc.
The lexing process uses the .get_tokens_unprocessed(text)
method from Lexer
class because it provides, as one of values, the starting
position of the token within the input text (index); it returns an iterable of
(<index>, <tokentype>, <value>)
tuples. This is required to be able to
split multiline tokens in such way that we have correct tokenization
of each changed line. Per-line tokenization is in turn needed to determine
the type (kind) of the line.
Lexer selection¶
The Pygments lexer is selected using the
.get_lexer_for_filename(_fn, code=None, **options)
method (passing only the filename, as of version 0.4.1).
If no lexer is found, then diff-annotate
uses Text lexer
(TextLexer
)
as a fallback (as of version 0.4.1).
Lexers are cached under file extension (suffix).
Input for lexer¶
If the source of patches (unified diffs) are patches on disk, then what is passed to the lexer is pre-image hunk or post-image hunk reconstructed, respectively, from context lines and deleted lines or context lines and added lines.
For example, given the following patch
diff --git a/tqdm/contrib/__init__.py b/tqdm/contrib/__init__.py
index 1dddacf..935ab63 100644
--- a/tqdm/contrib/__init__.py
+++ b/tqdm/contrib/__init__.py
@@ -38,7 +38,7 @@ def tenumerate(iterable, start=0, total=None, tqdm_class=tqdm_auto,
if isinstance(iterable, np.ndarray):
return tqdm_class(np.ndenumerate(iterable),
total=total or len(iterable), **tqdm_kwargs)
- return enumerate(tqdm_class(iterable, start, **tqdm_kwargs))
+ return enumerate(tqdm_class(iterable, **tqdm_kwargs), start)
def _tzip(iter1, *iter2plus, **tqdm_kwargs):
if isinstance(iterable, np.ndarray):
return tqdm_class(np.ndenumerate(iterable),
total=total or len(iterable), **tqdm_kwargs)
return enumerate(tqdm_class(iterable, start, **tqdm_kwargs))
def _tzip(iter1, *iter2plus, **tqdm_kwargs):
return enumerate(tqdm_class(iterable, start, **tqdm_kwargs))
If this change (this diff) was extracted from the tqdm project repository, instead of passing pre-image or post-image hunk of changes to the lexer, we can pass whole pre-image or post-image file contents.
In this case, the change came from the commit c0dcf39
in the tqdm repository. The pre-image
file contents would be 8cc777fe:tqdm/contrib/__init__.py
,
and post-image contents would be c0dcf39:tqdm/contrib/__init__.py
.
The diff-annotate patch ...
and diff-annotate dataset ...
pass pre-image
and post-image hunk to lexer, while diff-annotate from-repo ...
passes
whole pre-image ad post-image file contents. In the latter case, you can
turn off this feature (e.g. to achieve better performance) with --no-use-repo
option.
Determining line type (kind)¶
The next step is processing changed lines, one by one, and (among others) determining the line type (line kind).
Build-in rules¶
The custom rules for language and purpose detection of changed files is contained 1 variable and in 3 function.
The PURPOSE_TO_ANNOTATION
global variable is (as of version 0.4.1)
used to determine the type (kind) of the line in a very simple way:
if the changed file pathname (before changes, or after changes, respectively)
matches one of pattern contained in PURPOSE_TO_ANNOTATION
, then the
whole line has this purpose as line type. In this case we don't perform
or retrieve tokenization (lexing).
Otherwise, the following rule is applied (see code in .process()
method
of the AnnotatedHunk
class in diffannotator.annotate
module):
- If line passes
line_is_comment()
test, then its type is "documentation". - Otherwise,
purpose_to_default_annotation()
function is consulted, which returns "code" for files with purpose of "programming", or file purpose as line type otherwise.
Line is declared as comment (by line_is_comment()
) if the following
conditions are all true:
- among line tokens there is at least one token corresponding to comments
- there are only comment tokens or whitespace tokens in that line
Configuration from command line¶
You can change how changed line are processed by providing line callback
with the --line-callback
option.
The --line-callback
option is modeled on the callbacks in git-filter-repo.
You can use it in one of two ways.
First possibility is to provide the body of the callback function on the command line, using the command line argument like
diff-annotate --line-callback 'BODY' ...
def line_callback(file_data, tokens):
BODY
diff-annotate
script checks for
the existence of this return
statement.
Second possibility is to provide the path name to a file with the callback function
diff-annotate --line-callback <callback_script.py>
data/experiments/
,
in the HaPy-Bug/
subdirectory, as hapybug_line_callback_func.py
:
def line_callback(file_data, tokens): # NOTE: function definition must currently be first line of the file
# based on the code used to generate initial annotations for HaPy-Bug
# https://github.com/ncusi/python_cve_dataset/blob/main/annotation/annotate.py#L80
#print(f"RUNNING line_callback({file_data!r}, ...) -> {''.join([t[2] for t in tokens]).rstrip()}")
line_type = file_data['purpose']
if file_data['type'] != "programming":
if file_data['purpose'] not in ["documentation", "test"]:
line_type = "bug(fix)"
else:
# For programming languages
if line_is_comment(tokens):
line_type = "documentation"
#print(f" line is comment, {file_data=}, {line_type=}")
elif file_data['purpose'] == "test":
line_type = "test"
else:
line_type = "bug(fix)"
#print(f" returning {line_type=}")
return line_type
Note: actually, the
diff-annotate
script processing the--line-callback
parameter first checks if it can be interpreted as file name, and if file with given pathname does not exist, it interprets this parameter as function or function body.For the parameter contents or the file contents to be interpreted as function definition rather than as function body, the content must start with
def
on its first line.(as of PatchScope version 0.4.1)
Computing commit and diff metadata¶
TODO
Output format¶
TODO