Code Quality

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Summary

This module captures code specific metrics of input data. The implementation is borrowed from the work done in CodeParrot and StarCoder projects. In the current implementation, the module includes the following metrics & reports each metrics in individual column:

line specific metrics include mean & max line length
character and token ratio - uses the input tokenizer to tokenize the input data & measure the ratio between the characters and tokens
identifies the high occurrence of the keywords "test " or "config" and tags them as config or test samples
tags the samples as autogenerated if the sample contains keywords like auto-generated, autogenerated or automatically generated
programming language specific identification, where:
- if the input sample is python programming language and sample has no reference to constructs like def, class, it is highlighted as has_no_keywords

This module adds the following fields into the output file:

line_mean
line_max
total_num_lines
avg_longest_lines
alphanum_frac
char_token_ratio
autogenerated
config_or_test
has_no_keywords
has_few_assignments
is_xml
is_html

It uses a tokenizer to collect metrics specific to token ratio. It is designed to download the tokenizer from the Huggingface if the input tokenizer is not found in the local cache. By default, it uses codeparrot/codeparrot tokenizer.

Running

Launcher Command Line Options

The following command line arguments are available in addition to the options provided by the ray launcher and the python launcher.

"--contents_column_name" - input a column name which contains data to process. The default column name: contents
"--language_column_name" - input a column name which contains programming language details. The default column name: language
"--tokenizer" - input a tokenizer to convert the data into tokens. The default tokenizer is codeparrot/codeparrot
"--hf_token" - input the Hugging Face auth token to download the tokenizer. This option is only required for the tokenizer's whose access is restricted in Hugging Face.

Running the samples

To run the samples, use the following make targets

run-cli-sample - runs src/code_quality_transform_python.py using command line args
run-local-sample - runs src/code_quality_local_python.py

These targets will activate the virtual environment and set up any configuration needed. Use the -n option of make to see the detail of what is done to run the sample.

For example,

make run-cli-sample
...

Then

ls output

To see results of the transform.

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.