Vulnerability or Defect detection is a major problem in software engineering. Of late, many new ML models have been proposed to solve this problem. To help research in this area, we recently released the D2A dataset which is based on Infer Static Analyzer bug reports for real-world C programming language data. D2A goes one step further and performs a differential analysis by comparing the before (potential vulnerability) and after (fix of the vulnerability) versions to make an assessment and label the before version as vulnerable if it detects a change compared to the after version. Through this leaderboard we explore multiple ways to solve the vulnerability detection problem by identifying the real vulnerabilities amongst the many possibilities generated by static analysis.

From the D2A dataset we have extracted three types of data:
1. Infer Bug Reports (Trace)
2. Bug function source code (Function)
3. Bug function source code, trace functions source code and bug function file URL (Code).

From these three types of data we have created 4 real-bug prediction tasks:
1. Code + Trace
2. Trace
3. Code
4. Function.

Since the Function dataset is balanced, we use Accuracy to evaluate the results. For the other unbalanced datasets we use F1 and AUROC. More details on the data and the metrics can be found here.
To correctly identify a defect or vulnerability we believe that the model must be able to extract features from different types of data. And since most vulnerabilities are spread across multiple functions, it's also important for the model to be able to perform inter-procedural feature extraction. The Trace data is a combination of natural language and source code. Code data contains source code for the multiple functions found in the Trace. This leaderboard should demonstrate the success of models to detect the real vulnerabilities in each of these tasks.

 

Leaderboard

Overall Score Code + Trace Trace Code Function
Rank Model Team Organization Date Average F1 AUC F1 AUC F1 AUC Accuracy
1
AugSA-S
AI4VA
IBM Research
03/26/2021
70.8
63.4
83.6
61.1
81.2
65.8
85.2
55.2
2
C-BERT
AI4VA
IBM Research
03/26/2021
70.5
66.1
81.7
62.4
80.4
62.4
80.2
60.2
3
AugSA-V
AI4VA
IBM Research
03/26/2021
69.6
64.3
85.0
61.3
80.2
65.2
85.7
45.6

 

Submission Instructions

The expected output is a probability score for each example of the test/dev set. The probability score is that probability that the example has label 1. Once your model is fully trained, you can check your model's performance on the dev set using the evaluation script here. To get the performance on test set follow the below steps:

  1. Generate your prediction output for the dev set.
  2. Evaluate dev set predictions aaccording to the evaluation script from the link above.
  3. Generate your prediction output for the test set.
  4. Submit the following information by emailing to saurabh.pujar@ibm.com

Your email must include:

  1. Prediction results on test set.
  2. Individual/Team Name: Name of the individual or the team to appear in the leaderboard.
  3. Model information: Name of the model/technique to appear in the leaderboard.

We recommend your email should also include:

  1. Prediction results on dev set.
  2. Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard.
  3. Model code: Training code for the model.
  4. Publication Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard.

 

 

 

How to cite

Please cite the D2A paper.