Partitioning algorithms¶

`ccpart(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)` ¶

Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Smallest clusters are iteratively assigned to testing.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing the dataset to be partitioned.	required
`sim_df`	`DataFrame`	DataFrame representing precomputed pairwise similarities between samples.	required
`field_name`	`str`	Name of the column in `df` used for clustering; if None, uses `sim_df` directly.	`None`
`label_name`	`str`	Name of the label column for balancing partitions; if None, no balancing is performed.	`None`
`test_size`	`float`	Fraction of the dataset to allocate to the test set.	`0.2`
`valid_size`	`float`	Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split.	`0.0`
`threshold`	`float`	Similarity threshold for connecting components when clustering.	`0.3`
`verbose`	`int`	Verbosity level for logging (higher values provide more detailed output).	`0`
`n_bins`	`int`	Number of bins to discretize continuous labels into for balancing purposes.	`10`
`filter_smaller`	`Optional[bool]`	Whether with the similarity metric less is less similar.	`True`

Returns:

Type	Description
`Union[Tuple[list, list, list], Tuple[list, list, list, list]]`	If `valid_size > 0`: returns (train_indices, test_indices, valid_indices, cluster_assignments) - Otherwise: returns (train_indices, test_indices, cluster_assignments)

`ccpart_random(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, seed=0, n_bins=10, filter_smaller=True)` ¶

Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Cluesters are assigned to testing randomly.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing the dataset to be partitioned.	required
`sim_df`	`DataFrame`	DataFrame representing precomputed pairwise similarities between samples.	required
`field_name`	`str`	Name of the column in `df` used for clustering; if None, uses `sim_df` directly.	`None`
`label_name`	`str`	Name of the label column for balancing partitions; if None, no balancing is performed.	`None`
`test_size`	`float`	Fraction of the dataset to allocate to the test set.	`0.2`
`valid_size`	`float`	Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split.	`0.0`
`threshold`	`float`	Similarity threshold for connecting components when clustering.	`0.3`
`verbose`	`int`	Verbosity level for logging (higher values provide more detailed output).	`0`
`n_bins`	`int`	Number of bins to discretize continuous labels into for balancing purposes.	`10`
`filter_smaller`	`Optional[bool]`	Whether with the similarity metric less is less similar.	`True`

Returns:

Type	Description
`Union[Tuple[list, list, list], Tuple[list, list, list, list]]`	If `valid_size > 0`: returns (train_indices, test_indices, valid_indices, cluster_assignments) - Otherwise: returns (train_indices, test_indices, cluster_assignments)

`random_partition(df, test_size, random_state=42, **kwargs)` ¶

Use random partitioning algorithm to generate training and evaluation subsets. Wrapper around the train_test_split function from scikit-learn.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with the entities to partition	required
`test_size`	`float`	Proportion of entities to be allocated to test subset, defaults to 0.2	required
`random_state`	`int`	Seed for pseudo-random number generator algorithm, defaults to 42	`42`

Returns:

Type	Description
`Tuple[pd.DataFrame, pd.DataFrame]`	A tuple with the indexes of training and evaluation samples.

Partitioning algorithms¶

ccpart(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True) ¶

ccpart_random(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, seed=0, n_bins=10, filter_smaller=True) ¶

random_partition(df, test_size, random_state=42, **kwargs) ¶

`ccpart(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)` ¶

`ccpart_random(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, seed=0, n_bins=10, filter_smaller=True)` ¶

`random_partition(df, test_size, random_state=42, **kwargs)` ¶