Dataset Generator¶
HestiaDatasetGenerator(data)
¶
Class for generating multiple Dataset partitions for generalisation evaluation.
Initialise class
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
DataFrame with the original data from which datasets will be generated. |
required |
calculate_augood(results, target_df, target_field_name, target_embds=None)
¶
Calculate the 'area under the GOOD curve' (AU-GOOD) metric.
This function calculates an AU-GOOD score by computing a weighted metric from similarity values obtained by comparing target deployment distribution to the training distribution. It returns both the weighted GOOD curve values and the AU-GOOD score.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
results
|
Dict[float, float]
|
A dictionary where keys are bins or thresholds (float) and values are metrics or counts associated with each bin. |
required |
target_df
|
DataFrame
|
A DataFrame containing the target data for similarity comparison. The column specified by |
required |
target_field_name
|
Optional[str]
|
Name of the field in |
required |
target_embds
|
Optional[ndarray]
|
A NumPy array containing the target embeddings for similarity calculation. |
None
|
Returns:
Type | Description |
---|---|
Tuple[np.ndarray, float]
|
A tuple containing: - |
calculate_partitions(sim_args=None, sim_df=None, label_name=None, min_threshold=0.0, threshold_step=0.05, test_size=0.2, valid_size=0.1, partition_algorithm='ccpart', random_state=42, n_partitions=None)
¶
Calculates multiple partitions of a dataset for training, validation, and testing based on sequence similarity.
Supports two partitioning algorithms: ccpart
and graph_part
. Additionally, it computes partitions for
different similarity thresholds and random partitions.
:example:
Example of partitioning with a similarity threshold of 0.3 and a test size of 0.2¶
partitions = calculate_partitions( sim_args=similarity_args, label_name='Y', min_threshold=0.2, threshold_step=0.05, test_size=0.2, partition_algorithm='ccpart', random_state=42 )
Accessing the partitions for a specific threshold¶
train_set = partitions[0.3]['train'] valid_set = partitions[0.3]['valid'] test_set = partitions[0.3]['test']
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sim_args
|
Optional[SimilarityArguments]
|
Object containing the similarity parameters for partitioning. This includes options for calculating sequence similarity, such as the alignment method and similarity threshold. Defaults to None. |
None
|
sim_df
|
Optional[DataFrame]
|
Precomputed similarity DataFrame. If None, the similarity will be calculated using |
None
|
label_name
|
Optional[str]
|
The name of the label column for the dataset. Defaults to None. |
None
|
min_threshold
|
Optional[float]
|
The minimum similarity threshold to start partitioning. Defaults to 0.0. |
0.0
|
threshold_step
|
Optional[float]
|
The step size for varying the similarity threshold during partitioning. Defaults to 0.05. |
0.05
|
test_size
|
Optional[float]
|
The proportion of the dataset to allocate to the test set. Defaults to 0.2. |
0.2
|
valid_size
|
Optional[float]
|
The proportion of the training set to allocate to the validation set. Defaults to 0.1. |
0.1
|
partition_algorithm
|
Optional[str]
|
The partitioning algorithm to use. Options are: - |
'ccpart'
|
random_state
|
Optional[int]
|
The random seed for reproducibility. Defaults to 42. |
42
|
n_partitions
|
Optional[int]
|
The number of partitions to create when using |
None
|
Returns:
Type | Description |
---|---|
dict
|
A dictionary containing the partitions for each threshold. The dictionary has keys: - |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported partition algorithm is specified. |
calculate_similarity(sim_args)
¶
Calculate pairwise similarity between all the elements in the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sim_args
|
SimilarityArguments
|
See similarity arguments entry. |
required |
compare_models(model_results, statistical_test='wilcoxon')
staticmethod
¶
Compare the generalisation capabilities of n models against each other, providing p-values for every possible pair of models measuring how likely is model A to be better performing than model B.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_results
|
Dict[str, Union[List[float], ndarray]]
|
Dictionary with model name as key and a list with the ordered performance values of the model at different thresholds. |
required |
statistical_test
|
str
|
Statistical test to compute the model differences. Currently supported: - |
'wilcoxon'
|
from_precalculated(data_path)
¶
Load partition indexes if they have already being calculated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_path
|
str
|
Path to saved partition indexes. |
required |
SimilarityArguments(data_type='protein', field_name='sequence', min_threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename='alignment', sim_function=None, bits=None, radius=None, fingerprint=None, denominator=None, representation=None, prefilter=None, alignment_algorithm=None, query_embds=None, target_embds=None, target_df=None, needle_config=None)
¶
Dataclass with the inputs for similarity calculation.