Vulnerability or Defect detection is a major problem in software engineering. Of late, many new ML models have been proposed
to solve this problem. To help research in this area, we recently released the D2A dataset
which is based on Infer Static Analyzer bug reports for real-world C programming language data. D2A goes one step further and
performs a differential analysis by comparing the before (potential vulnerability) and after (fix of the vulnerability) versions
to make an assessment and label the before version as vulnerable if it detects a change compared to the after version. Through
this leaderboard we explore multiple ways to solve the vulnerability detection problem by identifying the real vulnerabilities
amongst the many possibilities generated by static analysis.
From the D2A dataset we have extracted three types of data:
1. Infer Bug Reports (Trace)
2. Bug function source code (Function)
3. Bug function source code, trace functions source code and bug function file URL (Code).
From these three types of data we have created 4 real-bug prediction tasks:
1. Code + Trace
2. Trace
3. Code
4. Function.
Since the Function dataset is balanced, we use Accuracy to evaluate
the results. For the other unbalanced datasets we use F1 and AUROC. More details on the data and the metrics can be found
here.
To correctly identify a defect or vulnerability we believe that the model must be able to extract features from different types
of data. And since most vulnerabilities are spread across multiple functions, it's also important for the model to be able to
perform inter-procedural feature extraction. The Trace data is a combination of natural language and source code. Code data contains
source code for the multiple functions found in the Trace. This leaderboard should demonstrate the success of models to detect
the real vulnerabilities in each of these tasks.
Code + Trace | Trace | Code | Function | Overall Score | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Rank | Model | Team | Organization | Date | F1 | AUC | F1 | AUC | F1 | AUC | Accuracy | Average |
1
|
AugSA-S
|
AI4VA
|
IBM Research
|
03/26/2021
|
63.4
|
83.6
|
61.1
|
81.2
|
65.8
|
85.2
|
55.2
|
70.8
|
2
|
C-BERT
|
AI4VA
|
IBM Research
|
03/26/2021
|
66.1
|
81.7
|
62.4
|
80.4
|
62.4
|
80.2
|
60.2
|
70.5
|
3
|
AugSA-V
|
AI4VA
|
IBM Research
|
03/26/2021
|
64.3
|
85.0
|
61.3
|
80.2
|
65.2
|
85.7
|
45.6
|
69.6
|
-
|
ML4CSec
|
Imperial College London
|
11/10/2021
|
-
|
-
|
-
|
-
|
-
|
-
|
62.3
|
-
|
|
-
|
ML4CSec
|
Imperial College London
|
11/10/2021
|
-
|
-
|
-
|
-
|
-
|
-
|
60.7
|
-
|
Rank | Model | Team | Organization | Date | F1 | AUC | Average |
---|---|---|---|---|---|---|---|
1
|
AugSA-V
|
AI4VA
|
IBM Research
|
03/26/2021
|
64.3
|
85.0
|
74.6
|
2
|
C-BERT
|
AI4VA
|
IBM Research
|
03/26/2021
|
66.1
|
81.7
|
73.9
|
3
|
AugSA-S
|
AI4VA
|
IBM Research
|
03/26/2021
|
63.4
|
83.6
|
73.5
|
Rank | Model | Team | Organization | Date | F1 | AUC | Average |
---|---|---|---|---|---|---|---|
1
|
C-BERT
|
AI4VA
|
IBM Research
|
03/26/2021
|
62.4
|
80.4
|
71.4
|
2
|
AugSA-S
|
AI4VA
|
IBM Research
|
03/26/2021
|
61.1
|
81.2
|
71.1
|
3
|
AugSA-V
|
AI4VA
|
IBM Research
|
03/26/2021
|
61.3
|
80.2
|
70.7
|
Rank | Model | Team | Organization | Date | F1 | AUC | Average |
---|---|---|---|---|---|---|---|
1
|
AugSA-S
|
AI4VA
|
IBM Research
|
03/26/2021
|
65.8
|
85.2
|
75.5
|
2
|
AugSA-V
|
AI4VA
|
IBM Research
|
03/26/2021
|
65.2
|
85.7
|
75.4
|
3
|
C-BERT
|
AI4VA
|
IBM Research
|
03/26/2021
|
62.4
|
80.2
|
71.3
|
Rank | Model | Team | Organization | Date | Accuracy |
---|---|---|---|---|---|
1
|
ML4CSec
|
Imperial College London
|
11/10/2021
|
62.3
|
|
2
|
ML4CSec
|
Imperial College London
|
11/10/2021
|
60.7
|
|
3
|
C-BERT
|
AI4VA
|
IBM Research
|
03/26/2021
|
60.2
|
4
|
AugSA-S
|
AI4VA
|
IBM Research
|
03/26/2021
|
55.2
|
5
|
AugSA-V
|
AI4VA
|
IBM Research
|
03/26/2021
|
45.6
|
The expected output is a probability score for each example of the test/dev set. The probability score is that probability that the example has label 1. Once your model is fully trained, you can check your model's performance on the dev set using the evaluation script here. To get the performance on test set follow the below steps:
Your email must include:
We recommend your email should also include: