Dataset Generator¶

`HestiaDatasetGenerator(data)` ¶

Class for generating multiple Dataset partitions for generalisation evaluation.

Initialise class

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame with the original data from which datasets will be generated.	required

`calculate_augood(results, target_df, target_field_name, target_embds=None)` ¶

Calculate the 'area under the GOOD curve' (AU-GOOD) metric.

This function calculates an AU-GOOD score by computing a weighted metric from similarity values obtained by comparing target deployment distribution to the training distribution. It returns both the weighted GOOD curve values and the AU-GOOD score.

Parameters:

Name	Type	Description	Default
`results`	`Dict[float, float]`	A dictionary where keys are bins or thresholds (float) and values are metrics or counts associated with each bin.	required
`target_df`	`DataFrame`	A DataFrame containing the target data for similarity comparison. The column specified by `target_field_name` will be used to populate the similarity arguments for comparison.	required
`target_field_name`	`Optional[str]`	Name of the field in `target_df` that contains target values for comparison.	required
`target_embds`	`Optional[ndarray]`	A NumPy array containing the target embeddings for similarity calculation.	`None`

Returns:

Type	Description
`Tuple[np.ndarray, float]`	A tuple containing: - `good_curve` (np.ndarray): Array of weighted values representing the GOOD curve. - `au_good` (float): The calculated area under the GOOD curve.

`calculate_partitions(sim_args=None, sim_df=None, label_name=None, min_threshold=0.0, threshold_step=0.05, test_size=0.2, valid_size=0.1, partition_algorithm='ccpart', random_state=42, n_partitions=None)` ¶

Calculates multiple partitions of a dataset for training, validation, and testing based on sequence similarity. Supports two partitioning algorithms: ccpart and graph_part. Additionally, it computes partitions for different similarity thresholds and random partitions.

:example:

Example of partitioning with a similarity threshold of 0.3 and a test size of 0.2¶

partitions = calculate_partitions( sim_args=similarity_args, label_name='Y', min_threshold=0.2, threshold_step=0.05, test_size=0.2, partition_algorithm='ccpart', random_state=42 )

Accessing the partitions for a specific threshold¶

train_set = partitions[0.3]['train'] valid_set = partitions[0.3]['valid'] test_set = partitions[0.3]['test']

Parameters:

Name	Type	Description	Default
`sim_args`	`Optional[SimilarityArguments]`	Object containing the similarity parameters for partitioning. This includes options for calculating sequence similarity, such as the alignment method and similarity threshold. Defaults to None.	`None`
`sim_df`	`Optional[DataFrame]`	Precomputed similarity DataFrame. If None, the similarity will be calculated using `sim_args`.	`None`
`label_name`	`Optional[str]`	The name of the label column for the dataset. Defaults to None.	`None`
`min_threshold`	`Optional[float]`	The minimum similarity threshold to start partitioning. Defaults to 0.0.	`0.0`
`threshold_step`	`Optional[float]`	The step size for varying the similarity threshold during partitioning. Defaults to 0.05.	`0.05`
`test_size`	`Optional[float]`	The proportion of the dataset to allocate to the test set. Defaults to 0.2.	`0.2`
`valid_size`	`Optional[float]`	The proportion of the training set to allocate to the validation set. Defaults to 0.1.	`0.1`
`partition_algorithm`	`Optional[str]`	The partitioning algorithm to use. Options are: - `'ccpart'`: Community detection partitioning algorithm. - `'graph_part'`: Graph-based partitioning. Defaults to `'ccpart'`.	`'ccpart'`
`random_state`	`Optional[int]`	The random seed for reproducibility. Defaults to 42.	`42`
`n_partitions`	`Optional[int]`	The number of partitions to create when using `graph_part`. Defaults to None.	`None`

Returns:

Type	Description
`dict`	A dictionary containing the partitions for each threshold. The dictionary has keys: - `train`: DataFrame for the training set. - `valid`: DataFrame for the validation set. - `test`: DataFrame for the test set. - `clusters`: The clusters formed by the partitioning algorithm. - For random partitions, the key `'random'` will contain the train, valid, and test sets.

Raises:

Type	Description
`ValueError`	If an unsupported partition algorithm is specified.

`calculate_similarity(sim_args)` ¶

Calculate pairwise similarity between all the elements in the dataset.

Parameters:

Name	Type	Description	Default
`sim_args`	`SimilarityArguments`	See similarity arguments entry.	required

`compare_models(model_results, statistical_test='wilcoxon')` `staticmethod` ¶

Compare the generalisation capabilities of n models against each other, providing p-values for every possible pair of models measuring how likely is model A to be better performing than model B.

Parameters:

Name	Type	Description	Default
`model_results`	`Dict[str, Union[List[float], ndarray]]`	Dictionary with model name as key and a list with the ordered performance values of the model at different thresholds.	required
`statistical_test`	`str`	Statistical test to compute the model differences. Currently supported: - `wilcoxon`: Wilcoxon ranked-sum test Defaults to 'wilcoxon'	`'wilcoxon'`

`from_precalculated(data_path)` ¶

Load partition indexes if they have already being calculated.

Parameters:

Name	Type	Description	Default
`data_path`	`str`	Path to saved partition indexes.	required

`load_similarity(output_path)` ¶

Load similarity calculation from file.

Parameters:

Name	Type	Description	Default
`output_path`	`str`	File with similarity calculations.	required

`save_precalculated(output_path, include_metada=True)` ¶

Save partition indexes to disk for quickier re-running.

Parameters:

Name	Type	Description	Default
`output_path`	`str`	Path where partition indexes should be saved.	required

`SimilarityArguments(data_type='protein', field_name='sequence', min_threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename='alignment', sim_function=None, bits=None, radius=None, fingerprint=None, denominator=None, representation=None, prefilter=None, alignment_algorithm=None, query_embds=None, target_embds=None, target_df=None, needle_config=None)` ¶

Dataclass with the inputs for similarity calculation.

Dataset Generator¶

HestiaDatasetGenerator(data) ¶

calculate_augood(results, target_df, target_field_name, target_embds=None) ¶

calculate_partitions(sim_args=None, sim_df=None, label_name=None, min_threshold=0.0, threshold_step=0.05, test_size=0.2, valid_size=0.1, partition_algorithm='ccpart', random_state=42, n_partitions=None) ¶

Example of partitioning with a similarity threshold of 0.3 and a test size of 0.2¶

Accessing the partitions for a specific threshold¶

calculate_similarity(sim_args) ¶

compare_models(model_results, statistical_test='wilcoxon') staticmethod ¶

from_precalculated(data_path) ¶

load_similarity(output_path) ¶

save_precalculated(output_path, include_metada=True) ¶

`HestiaDatasetGenerator(data)` ¶

`calculate_augood(results, target_df, target_field_name, target_embds=None)` ¶

`calculate_partitions(sim_args=None, sim_df=None, label_name=None, min_threshold=0.0, threshold_step=0.05, test_size=0.2, valid_size=0.1, partition_algorithm='ccpart', random_state=42, n_partitions=None)` ¶

`calculate_similarity(sim_args)` ¶

`compare_models(model_results, statistical_test='wilcoxon')` `staticmethod` ¶

`from_precalculated(data_path)` ¶

`load_similarity(output_path)` ¶

`save_precalculated(output_path, include_metada=True)` ¶