Skip to content

Partitioning algorithms

ccpart(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)

Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Smallest clusters are iteratively assigned to testing.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the dataset to be partitioned.

required
sim_df DataFrame

DataFrame representing precomputed pairwise similarities between samples.

required
field_name str

Name of the column in df used for clustering; if None, uses sim_df directly.

None
label_name str

Name of the label column for balancing partitions; if None, no balancing is performed.

None
test_size float

Fraction of the dataset to allocate to the test set.

0.2
valid_size float

Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split.

0.0
threshold float

Similarity threshold for connecting components when clustering.

0.3
verbose int

Verbosity level for logging (higher values provide more detailed output).

0
n_bins int

Number of bins to discretize continuous labels into for balancing purposes.

10
filter_smaller Optional[bool]

Whether with the similarity metric less is less similar.

True

Returns:

Type Description
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
  • If valid_size > 0: returns (train_indices, test_indices, valid_indices, cluster_assignments) - Otherwise: returns (train_indices, test_indices, cluster_assignments)

ccpart_random(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, seed=0, n_bins=10, filter_smaller=True)

Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Cluesters are assigned to testing randomly.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the dataset to be partitioned.

required
sim_df DataFrame

DataFrame representing precomputed pairwise similarities between samples.

required
field_name str

Name of the column in df used for clustering; if None, uses sim_df directly.

None
label_name str

Name of the label column for balancing partitions; if None, no balancing is performed.

None
test_size float

Fraction of the dataset to allocate to the test set.

0.2
valid_size float

Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split.

0.0
threshold float

Similarity threshold for connecting components when clustering.

0.3
verbose int

Verbosity level for logging (higher values provide more detailed output).

0
n_bins int

Number of bins to discretize continuous labels into for balancing purposes.

10
filter_smaller Optional[bool]

Whether with the similarity metric less is less similar.

True

Returns:

Type Description
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
  • If valid_size > 0: returns (train_indices, test_indices, valid_indices, cluster_assignments) - Otherwise: returns (train_indices, test_indices, cluster_assignments)

random_partition(df, test_size, random_state=42, **kwargs)

Use random partitioning algorithm to generate training and evaluation subsets. Wrapper around the train_test_split function from scikit-learn.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with the entities to partition

required
test_size float

Proportion of entities to be allocated to test subset, defaults to 0.2

required
random_state int

Seed for pseudo-random number generator algorithm, defaults to 42

42

Returns:

Type Description
Tuple[pd.DataFrame, pd.DataFrame]

A tuple with the indexes of training and evaluation samples.