Skip to content

Annotation process

Annotation process

Installing the patchscope package installs scripts (currently three) that you can run to generate patches, annotate them, and extract their statistics. Every script name starts with the diff-* prefix.

Each script and subcommand (for those scripts that have multiple subcommands) support the --help option.

  • diff-generate: used to generate patches (*.patch and *.diff files) from a given repository, in the format suitable for later analysis;
  • diff-annotate: annotates existing dataset (patch files in subdirectories), or annotates selected subset of commits (of changes in commits) in the given Git repository, producing annotation data (*.json files);
  • diff-gather-stats: compute various statistics and metrics from patch annotation data generated by the diff-annotate script, usually producing a single *.json file.

Patches source

There are two possible sources of patches (unified diffs) to annotate: existing patch on disk, or git repository.

Patches and patch datasets

The first possible data source is a patch, or series of patches (a dataset), saved as a *.diff file on disk. This can be some pre-existing dataset, like BugsInPy1 - one with patches stored as files on disk (instead of being stored in some database). It can also be a result of running the diff-generate script.

Two of the diff-annotate subcommand support this type of the diff data source:

  • diff-annotate patch [OPTIONS] PATCH_FILE RESULT_JSON: annotate a single PATCH_FILE, writing results to RESULT_JSON,
  • diff-annotate dataset [OPTIONS] DATASETS...: annotate all bugs in provided DATASETS,

The patch subcommand of the diff-annotate command (script) is mainly intended as a helper in examining and debugging the annotation process and the annotation format.

The dataset subcommand is, instead, meant to annotate existing dataset of patches, that is set of patches in specific directory or directories.

You can annotate more than one dataset (directory) at once, though they need to have the same internal structure. By default, each dataset is expected to be an existing directory with the following path structure:

//patches/.diff

This directory structure follows the structure used by the BugsInPy1 dataset. You can change the /patches/ part with the --patches-dir option, or eliminate it all together. For example, with --patches-dir='', the diff-annotate script would expect data to have the following structure:

//.diff

Each dataset can consist with one or more bugs, each bug should include at least one *.diff file to annotate.

By default, annotation data is saved beside patches, in the same directory structures, as JSON files - one file per patch / diff:

//annotation/.json

You can change the /annotation/ part with the --annotations-dir option. You can also make diff-annotate save annotation data in a separate place provided with the --output-prefix option (but with the same directory structure as mentioned above).

Git repository

Another source of patches to annotate are selected commits from the local git repository.

One of diff-annotate subcommand support this type of the diff data source:

  • diff-anotate from-repo [OPTIONS] REPO_PATH [REVISION_RANGE...]: create annotation data for commits from local Git repository (with REVISION_RANGE... passed as arguments to the git log command)

There is one required option: --output-dir, which you need to provide for diff-annotate from-repo to know where to store the annotation data. By default, the output JSON files are stored as:

/.json

You can, if you want, create layout like the one in BugsInPy1 (for example, when reproducing whole diffs instead of using simplified diffs provided as *.diff files by the BugsInPy dataset). If you expect many commits, you can use --use-fanout flag to limit the number of files stored in a single directory.

You can select which commits you want to annotate by providing appropriate <revision-range> argument, which is passed to git log. If not provided, it defaults to HEAD (that is, the whole history of the current branch). For a complete list of ways to spell <revision-range>, see the "Specifying Ranges" section of the gitrevisions(7) manpage.

Here are a few git log options that are often used with diff-annotate from-repo:

  • --max-parents=1, or --no-merges is probably the most commonly used option; it limits commits to those with maximum of 1 parent, dropping merge commits (which as of PatchScope 0.4.1 are using the --diff-merges=first-parent to generate unified diff out of merge, comparing it against first parent, and often generating very large diff encompassing all changes made on merged in branch),
  • --min-parents=1 can be used to drop root commits (rarely used),
  • --author=<email> or --author=<name> can be used to limit to commits authored (created) by a given author,
  • <tag1>..<tag2> revision range specifier (like e.g. v6.8..v6.9) to select all changes between two tagged revisions, perhaps together with --ancestry-path option,
  • date range with --after=<date> and/or --before=<date> (like e.g. --after=2021.01.01 --before=2023.12.31) to select all changes available from default / current / selected branch that were (note: stopping at the first commit which is older than a specific date, unless you use --since-as-filter=<date> instead of --since or --after).

To reproduce (redo) an existing bug dataset, provided that we know ids of bug-fixing commits, can be currently (as of 0.4.1) done with the following git log option:

  • --no-walk <commit1> <commit2>... to only process selected commits, but do not traverse their ancestors.

File language detection

Next step in the annotation process is detecting the language and purpose of each changed file in a diff. This is done with the help of the GitHub Linguist, which is the library is used on GitHub.com to detect blob languages, ignore binary or vendored files, suppress generated files in diffs, and generate language breakdown graphs.

To be more exact, language detection and file purpose detection is done by a custom code that uses language.yml file from GitHub Linguist (or rather its local copy), and some custom rules.

This code is more limited than GitHub Linguist, as it (currently, as of 0.4.1) uses only the pathname of the file (extension, basename, directory it is in), and it does not examine its contents (which, in the case of patches and datasets might be not available; in the case of repositories, it would have to be retrieved from specific commit in the repository).

On the other hand, PatchScope's custom rules provide more types of file purpose than GitHub Linguists' data, programming, markup, prose, or nil. Those new purposes include project, documentation, test, or unknown.

File purpose of changed file can and is later used to annotate changed line in that file with type of the changed line.

The result of file language detection is the following mapping (dict):

{
    "language": language,    # e.g. "Markdown"
    "type": filetype,        # e.g. "prose"
    "purpose": file_purpose  # e.g. "documentation"
}

Using PyLinguist (optional)

GitHub Linguist is a Ruby library, which makes it difficult to integrate with Python-based PatchScope. There is, however, a Python clone of github/linguist. The version available from PyPI is not the most up to date; you can install the retanoj/linguist fork of original douban/linguist Python clone (itself fork of liluo/linguist) with

pip install patchscope[pylinguist]
or directly with
pip install linguist@git+https://github.com/retanoj/linguist#egg=master

As of PatchScope version 0.4.1, using PyLinguist, which you can turn on with --use-pytlinguist global option of the diff-annotate script, still only uses pathname of changed file.

Note that PyLinguist does not keep its languages.yml up to date, that is why by default diff-annotate will make it use its own copy of the file. You can turn it off with --no-update-languages flag.

All language detection code is in diffannotator.languages module (which is created from the src/diffannotator/languages.py file).

Build-in custom rules

The custom rules for language and purpose detection of changed files is contained in 3 global variables and 2 functions.

Both FILENAME_TO_LANGUAGES and EXT_TO_LANGUAGES are used to augment and override what GitHub Linguist' languages.yml detects as language of the changed file, or provide language when languages.yml does not detect it.

For example, languages.yml detects README.me, read.me, readme.1st as "Text" files with purpose (called 'type' by GitHub Linguist) of "prose", but not the plain README file. The FILENAME_TO_LANGUAGES variable handles this case.

On the other hand, the ".md" extension is assigned by languages.yml to both "Markdown" (type: prose), and "GCC Machine Description" (type: programming). The EXT_TO_LANGUAGES variable contents is used to break this tie in favor of "Markdown".

There is yet another source of custom rules for finding file language, namely the languages_exceptions() function that takes file path of the changed file in the repository, and file language determined so far, and determines file language (same as determined so for, or changed).

Beside language name, languages.yml also provides the type field, which diff-annotate script presents as file purpose. Here, the PATTERN_TO_PURPOSE variable can be used to augment or override the data from languages.yml.

For example, it defines patterns for "project" files (like *.cmake or requirements.txt), something that is missing from the list of possible file types in languages.yml.

Note that as of PatchScope version 0.4.1, pattern matching (which uses shell wildcards) is done using PurePath.match method from the Python pathlib standard library, with its limitations. Currently the recursive wildcard “*” acts like non-recursive “”. This may change in the future.

There is yet another source of custom rules for finding file purpose, namely the _path2purpose() static method in Languages class in diffannotator.languages module. It is this method that finds "test" files, and which translates GitHub Linguist's "prose" file type to PatchScope's "documentation" file purpose.

Configuration from command line

Both finding file language and finding file purpose is configurable from the command line.

  • --ext-to-language defines mapping from extension to file language.
    Examples of use include --ext-to-language=.rs:Rust and --ext-to-language=".S:Unix Assembly", or --ext-to-language=.cgi:Perl (the last one is a project-specific rule).
  • --filename-to-language defines mapping from filename to file language.
    Examples of use include --filename-to-language=changelog:Text, --filename-to-language=config.mak.in:Makefile, etc.
  • --pattern-to-purpose defines mapping from filename pattern to that file purpose.
    Examples of use include --pattern-to-purpose=Makefile.*:project, --pattern-to-purpose=*.dts:data, and --pattern-to-purpose=*.dts?:data.

In all those options, empty value resets mapping.

Tokenizing changes

Next step involves using a lexer or a parser for changes or for a whole changed file. Currently (as of version 0.4.1) the only supported lexer is the Pygments syntax highlighting library in Python.

The code that runs the lexer can be found in diffannotator.lexer module, which is created from the src/diffannotator/lexer.py source code file. This code is responsible for selecting and caching lexers, handling errors, etc.

The lexing process uses the .get_tokens_unprocessed(text) method from Lexer class because it provides, as one of values, the starting position of the token within the input text (index); it returns an iterable of (<index>, <tokentype>, <value>) tuples. This is required to be able to split multiline tokens in such way that we have correct tokenization of each changed line. Per-line tokenization is in turn needed to determine the type (kind) of the line.

Lexer selection

The Pygments lexer is selected using the .get_lexer_for_filename(_fn, code=None, **options) method (passing only the filename, as of version 0.4.1).

If no lexer is found, then diff-annotate uses Text lexer (TextLexer) as a fallback (as of version 0.4.1).

Lexers are cached under file extension (suffix).

Input for lexer

If the source of patches (unified diffs) are patches on disk, then what is passed to the lexer is pre-image hunk or post-image hunk reconstructed, respectively, from context lines and deleted lines or context lines and added lines.

For example, given the following patch

diff --git a/tqdm/contrib/__init__.py b/tqdm/contrib/__init__.py
index 1dddacf..935ab63 100644
--- a/tqdm/contrib/__init__.py
+++ b/tqdm/contrib/__init__.py
@@ -38,7 +38,7 @@ def tenumerate(iterable, start=0, total=None, tqdm_class=tqdm_auto,
         if isinstance(iterable, np.ndarray):
             return tqdm_class(np.ndenumerate(iterable),
                               total=total or len(iterable), **tqdm_kwargs)
-    return enumerate(tqdm_class(iterable, start, **tqdm_kwargs))
+    return enumerate(tqdm_class(iterable, **tqdm_kwargs), start)


 def _tzip(iter1, *iter2plus, **tqdm_kwargs):
the pre-image hunk would be
        if isinstance(iterable, np.ndarray):
            return tqdm_class(np.ndenumerate(iterable),
                              total=total or len(iterable), **tqdm_kwargs)
    return enumerate(tqdm_class(iterable, start, **tqdm_kwargs))


def _tzip(iter1, *iter2plus, **tqdm_kwargs):
That is what would get passed to lexer to allow for extracting tokenization of deleted lines, in this case it would be the following line:
    return enumerate(tqdm_class(iterable, start, **tqdm_kwargs))
The same process is applied to post-image of the hunk and to added lines.

If this change (this diff) was extracted from the tqdm project repository, instead of passing pre-image or post-image hunk of changes to the lexer, we can pass whole pre-image or post-image file contents.

In this case, the change came from the commit c0dcf39 in the tqdm repository. The pre-image file contents would be 8cc777fe:tqdm/contrib/__init__.py, and post-image contents would be c0dcf39:tqdm/contrib/__init__.py.

The diff-annotate patch ... and diff-annotate dataset ... pass pre-image and post-image hunk to lexer, while diff-annotate from-repo ... passes whole pre-image ad post-image file contents. In the latter case, you can turn off this feature (e.g. to achieve better performance) with --no-use-repo option.

Determining line type (kind)

The next step is processing changed lines, one by one, and (among others) determining the line type (line kind).

Build-in rules

The custom rules for language and purpose detection of changed files is contained 1 variable and in 3 function.

The PURPOSE_TO_ANNOTATION global variable is (as of version 0.4.1) used to determine the type (kind) of the line in a very simple way: if the changed file pathname (before changes, or after changes, respectively) matches one of pattern contained in PURPOSE_TO_ANNOTATION, then the whole line has this purpose as line type. In this case we don't perform or retrieve tokenization (lexing).

Otherwise, the following rule is applied (see code in .process() method of the AnnotatedHunk class in diffannotator.annotate module):

  1. If line passes line_is_comment() test, then its type is "documentation".
  2. Otherwise, purpose_to_default_annotation() function is consulted, which returns "code" for files with purpose of "programming", or file purpose as line type otherwise.

Line is declared as comment (by line_is_comment()) if the following conditions are all true:

  • among line tokens there is at least one token corresponding to comments
  • there are only comment tokens or whitespace tokens in that line

Configuration from command line

You can change how changed line are processed by providing line callback with the --line-callback option.

The --line-callback option is modeled on the callbacks in git-filter-repo. You can use it in one of two ways.

First possibility is to provide the body of the callback function on the command line, using the command line argument like

diff-annotate --line-callback 'BODY' ...
For this case, the following code will be compiled and called:
def line_callback(file_data, tokens):
    BODY
Thus, you just need to make sure your BODY returns appropriate line type (as string value). Note that the diff-annotate script checks for the existence of this return statement.

Second possibility is to provide the path name to a file with the callback function

diff-annotate --line-callback <callback_script.py>
An example of such callback function can be found in data/experiments/, in the HaPy-Bug/ subdirectory, as hapybug_line_callback_func.py:

def line_callback(file_data, tokens): # NOTE: function definition must currently be first line of the file

# based on the code used to generate initial annotations for HaPy-Bug
# https://github.com/ncusi/python_cve_dataset/blob/main/annotation/annotate.py#L80

#print(f"RUNNING line_callback({file_data!r}, ...) -> {''.join([t[2] for t in tokens]).rstrip()}")
line_type = file_data['purpose']

if file_data['type'] != "programming":
    if file_data['purpose'] not in ["documentation", "test"]:
        line_type = "bug(fix)"
else:
    # For programming languages
    if line_is_comment(tokens):
        line_type = "documentation"
        #print(f"  line is comment, {file_data=}, {line_type=}")
    elif file_data['purpose'] == "test":
        line_type = "test"
    else:
        line_type = "bug(fix)"

#print(f"  returning {line_type=}")
return line_type

Note: actually, the diff-annotate script processing the --line-callback parameter first checks if it can be interpreted as file name, and if file with given pathname does not exist, it interprets this parameter as function or function body.

For the parameter contents or the file contents to be interpreted as function definition rather than as function body, the content must start with def on its first line.

(as of PatchScope version 0.4.1)

Computing commit and diff metadata

TODO

Output format

TODO


  1. R. Widyasari et.al.: "BugsInPy: a database of existing bugs in Python programs to enable controlled testing and debugging studies", ESEC/FSE 2020, pp. 1556–1560, https://doi.org/10.1145/3368089.3417943