Code Quality
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
Summary
This module captures code specific metrics of input data. The implementation is borrowed from the work done in CodeParrot and StarCoder projects. In the current implementation, the module includes the following metrics & reports each metrics in individual column:
- line specific metrics include mean & max line length
- character and token ratio - uses the input tokenizer to tokenize the input data & measure the ratio between the characters and tokens
- identifies the high occurrence of the keywords "test " or "config" and tags them as config or test samples
- tags the samples as autogenerated if the sample contains keywords like
auto-generated
,autogenerated
orautomatically generated
- programming language specific identification, where:
- if the input sample is
python
programming language and sample has no reference to constructs like def, class, it is highlighted ashas_no_keywords
- if the input sample is
This module adds the following fields into the output file:
- line_mean
- line_max
- total_num_lines
- avg_longest_lines
- alphanum_frac
- char_token_ratio
- autogenerated
- config_or_test
- has_no_keywords
- has_few_assignments
- is_xml
- is_html
It uses a tokenizer to collect metrics specific to token ratio. It is designed to download the tokenizer from the Huggingface if the input tokenizer is not found in the local cache. By default, it uses codeparrot/codeparrot tokenizer.
Running
Launcher Command Line Options
The following command line arguments are available in addition to the options provided by the ray launcher and the python launcher.
- "--contents_column_name" - input a column name which contains data to process. The default column name:
contents
- "--language_column_name" - input a column name which contains programming language details. The default column name:
language
- "--tokenizer" - input a tokenizer to convert the data into tokens. The default tokenizer is
codeparrot/codeparrot
- "--hf_token" - input the Hugging Face auth token to download the tokenizer. This option is only required for the tokenizer's whose access is restricted in Hugging Face.
Running the samples
To run the samples, use the following make
targets
run-cli-sample
- runs src/code_quality_transform_python.py using command line argsrun-local-sample
- runs src/code_quality_local_python.py
These targets will activate the virtual environment and set up any configuration needed.
Use the -n
option of make
to see the detail of what is done to run the sample.
For example,
Then To see results of the transform.Transforming data using the transform image
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.