Skip to content

Dataset Generator

HestiaGenerator(data, verbose=True)

Class for generating multiple Dataset partitions for generalisation evaluation.

Initialise class

Parameters:

Name Type Description Default
data DataFrame

DataFrame with the original data from which datasets will be generated.

required

calculate_augood(results, target_df, target_field_name, target_embds=None, return_weights=False)

Calculate the 'area under the GOOD curve' (AU-GOOD) metric.

This function calculates an AU-GOOD score by computing a weighted metric from similarity values obtained by comparing target deployment distribution to the training distribution. It returns both the weighted GOOD curve values and the AU-GOOD score.

Parameters:

Name Type Description Default
results Dict[float, float]

A dictionary where keys are bins or thresholds (float) and values are metrics or counts associated with each bin.

required
target_df DataFrame

A DataFrame containing the target data for similarity comparison. The column specified by target_field_name will be used to populate the similarity arguments for comparison.

required
target_field_name Optional[str]

Name of the field in target_df that contains target values for comparison.

required
target_embds Optional[ndarray]

A NumPy array containing the target embeddings for similarity calculation.

None
return_weights bool

Return histogram values for train-deployment similarities

False

Returns:

Type Description
Union[Tuple[np.ndarray, float], Tuple[np.ndarray, float, np.ndarray]]

A tuple containing: - good_curve (np.ndarray): Array of weighted values representing the GOOD curve. - au_good (float): The calculated area under the GOOD curve. and optionally: - weights (np.ndarray): Array of weights representing train-deployment similarities

calculate_partitions(sim_args=None, sim_df=None, label_name=None, min_threshold=0.0, threshold_step=0.05, test_size=0.2, valid_size=0.1, partition_algorithm='ccpart', random_state=42, verbose=1, n_partitions=None)

Calculates multiple partitions of a dataset for training, validation, and testing based on sequence similarity. Supports two partitioning algorithms: ccpart and graph_part. Additionally, it computes partitions for different similarity thresholds and random partitions.

:example:

Example of partitioning with a similarity threshold of 0.3 and a test size of 0.2

partitions = calculate_partitions( sim_args=similarity_args, label_name='Y', min_threshold=0.2, threshold_step=0.05, test_size=0.2, partition_algorithm='ccpart', random_state=42 )

Accessing the partitions for a specific threshold

train_set = partitions[0.3]['train'] valid_set = partitions[0.3]['valid'] test_set = partitions[0.3]['test']

Parameters:

Name Type Description Default
sim_args Optional[SimArguments]

Object containing the similarity parameters for partitioning. This includes options for calculating sequence similarity, such as the alignment method and similarity threshold. Defaults to None.

None
sim_df Optional[DataFrame]

Precomputed similarity DataFrame. If None, the similarity will be calculated using sim_args.

None
label_name Optional[str]

The name of the label column for the dataset. Defaults to None.

None
min_threshold Optional[float]

The minimum similarity threshold to start partitioning. Defaults to 0.0.

0.0
threshold_step Optional[float]

The step size for varying the similarity threshold during partitioning. Defaults to 0.05.

0.05
test_size Optional[float]

The proportion of the dataset to allocate to the test set. Defaults to 0.2.

0.2
valid_size Optional[float]

The proportion of the training set to allocate to the validation set. Defaults to 0.1.

0.1
verbose int

Verbosity level for process logging, where higher values increase output detail.

1
partition_algorithm Optional[str]

The partitioning algorithm to use. Options are: - 'ccpart': Connected components algorithm that puts in testing the smallest unconnected clusters. - 'graph_part': GraphPart partitioning. - 'butina': Butina split - Connected components algorithm that puts in testing random clusters. Defaults to 'ccpart'.

'ccpart'
random_state Optional[int]

The random seed for reproducibility. Defaults to 42.

42
n_partitions Optional[int]

The number of partitions to create when using graph_part. Defaults to None.

None

Returns:

Type Description
dict

A dictionary containing the partitions for each threshold. The dictionary has keys: - train: DataFrame for the training set. - valid: DataFrame for the validation set. - test: DataFrame for the test set. - clusters: The clusters formed by the partitioning algorithm. - For random partitions, the key 'random' will contain the train, valid, and test sets.

Raises:

Type Description
ValueError

If an unsupported partition algorithm is specified.

calculate_similarity(sim_args)

Calculate pairwise similarity between all the elements in the dataset.

Parameters:

Name Type Description Default
sim_args SimArguments

See similarity arguments entry.

required

compare_models(model_results, statistical_test='wilcoxon') staticmethod

Compare the generalisation capabilities of n models against each other, providing p-values for every possible pair of models measuring how likely is model A to be better performing than model B.

Parameters:

Name Type Description Default
model_results Dict[str, Union[List[float], ndarray]]

Dictionary with model name as key and a list with the ordered performance values of the model at different thresholds.

required
statistical_test str

Statistical test to compute the model differences. Currently supported: - wilcoxon: Wilcoxon ranked-sum test Defaults to 'wilcoxon'

'wilcoxon'

from_precalculated(data_path)

Load partition indexes if they have already being calculated.

Parameters:

Name Type Description Default
data_path str

Path to saved partition indexes.

required

load_similarity(output_path)

Load similarity calculation from file.

Parameters:

Name Type Description Default
output_path str

File with similarity calculations.

required

save_precalculated(output_path, include_metada=True)

Save partition indexes to disk for quickier re-running.

Parameters:

Name Type Description Default
output_path str

Path where partition indexes should be saved.

required

SimArguments(data_type='protein', field_name='sequence', min_threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename='alignment', sim_function=None, bits=None, radius=None, fingerprint=None, denominator=None, representation=None, prefilter=None, alignment_algorithm=None, query_embds=None, target_embds=None, target_df=None, needle_config=None, **kwargs)

Dataclass with the inputs for similarity calculation.