Partitioning algorithms¶
ccpart(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)
¶
Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Smallest clusters are iteratively assigned to testing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
DataFrame containing the dataset to be partitioned. |
required |
sim_df
|
DataFrame
|
DataFrame representing precomputed pairwise similarities between samples. |
required |
field_name
|
str
|
Name of the column in |
None
|
label_name
|
str
|
Name of the label column for balancing partitions; if None, no balancing is performed. |
None
|
test_size
|
float
|
Fraction of the dataset to allocate to the test set. |
0.2
|
valid_size
|
float
|
Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split. |
0.0
|
threshold
|
float
|
Similarity threshold for connecting components when clustering. |
0.3
|
verbose
|
int
|
Verbosity level for logging (higher values provide more detailed output). |
0
|
n_bins
|
int
|
Number of bins to discretize continuous labels into for balancing purposes. |
10
|
filter_smaller
|
Optional[bool]
|
Whether with the similarity metric less is less similar. |
True
|
Returns:
Type | Description |
---|---|
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
|
|
ccpart_random(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, seed=0, n_bins=10, filter_smaller=True)
¶
Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Cluesters are assigned to testing randomly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
DataFrame containing the dataset to be partitioned. |
required |
sim_df
|
DataFrame
|
DataFrame representing precomputed pairwise similarities between samples. |
required |
field_name
|
str
|
Name of the column in |
None
|
label_name
|
str
|
Name of the label column for balancing partitions; if None, no balancing is performed. |
None
|
test_size
|
float
|
Fraction of the dataset to allocate to the test set. |
0.2
|
valid_size
|
float
|
Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split. |
0.0
|
threshold
|
float
|
Similarity threshold for connecting components when clustering. |
0.3
|
verbose
|
int
|
Verbosity level for logging (higher values provide more detailed output). |
0
|
n_bins
|
int
|
Number of bins to discretize continuous labels into for balancing purposes. |
10
|
filter_smaller
|
Optional[bool]
|
Whether with the similarity metric less is less similar. |
True
|
Returns:
Type | Description |
---|---|
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
|
|
random_partition(df, test_size, random_state=42, **kwargs)
¶
Use random partitioning algorithm
to generate training and evaluation subsets.
Wrapper around the train_test_split
function
from scikit-learn.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
DataFrame with the entities to partition |
required |
test_size
|
float
|
Proportion of entities to be allocated to test subset, defaults to 0.2 |
required |
random_state
|
int
|
Seed for pseudo-random number generator algorithm, defaults to 42 |
42
|
Returns:
Type | Description |
---|---|
Tuple[pd.DataFrame, pd.DataFrame]
|
A tuple with the indexes of training and evaluation samples. |