Skip to content

Partitioning algorithms

bitbirch(df, sim_df=None, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, branching_factor=50, n_bins=10, radius=2, bits=1024, n_clusters=20, **kwargs)

Partition a dataset using the BitBirch clustering algorithm.

Generates clusters based on molecular features or similarity using the BitBirch algorithm. Labels can be optionally discretized for balancing, and the dataset is partitioned into train/test/validation subsets using smallest_assignment. Prints warnings if the partition sizes deviate from expectations.

Reference: Pérez KL, Jung V, Chen L, Huddleston K, Miranda-Quintana RA. BitBIRCH: efficient clustering of large molecular libraries. Digital Discovery. 2025;4(4):1042-51.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing molecular entities.

required
sim_df DataFrame

Optional DataFrame containing pairwise similarity scores between entities, defaults to None.

None
field_name str

Optional column name used for clustering.

None
label_name str

Optional column name for labels used for balancing, defaults to None.

None
test_size float

Proportion of entities to allocate to the test subset, defaults to 0.2.

0.2
valid_size float

Proportion of entities to allocate to the validation subset, defaults to 0.0.

0.0
threshold float

Similarity threshold used for clustering, defaults to 0.3.

0.3
verbose int

Verbosity level. Higher values print detailed partition proportions, defaults to 0.

0
branching_factor int

Branching factor for BitBirch clustering, defaults to 50.

50
n_bins int

Number of bins used when discretizing labels for balancing, defaults to 10.

10
radius int

Neighborhood radius for BitBirch clustering, defaults to 2.

2
bits int

Number of bits used in fingerprint representation, defaults to 1024.

1024
n_clusters int

Number of clusters to generate, defaults to 20.

20

Returns:

Type Description
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]

If valid_size > 0 returns train, test, valid subsets plus cluster assignments. Otherwise returns train, test subsets plus cluster assignments.

butina(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)

Partition a dataset using a Butina-style greedy clustering algorithm.

Generates clusters based on molecular similarity using a greedy cover set approach (Butina clustering). Labels can be optionally discretized for balancing, and the dataset is partitioned into train/test/validation subsets using smallest_assignment. Prints warnings if the partition sizes deviate from expectations.

Generalized to work on similarity matrices rather than fingerprints directly.

Reference: Butina D. Unsupervised data base clustering based on daylight's fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. Journal of Chemical Information and Computer Sciences. 1999 Jul 26;39(4):747-50.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing molecular entities.

required
sim_df DataFrame

DataFrame or matrix containing pairwise similarity scores between entities.

required
field_name str

Optional column name used for clustering. If None, clustering is based solely on sim_df.

None
label_name str

Optional column name for labels used for balancing, defaults to None.

None
test_size float

Proportion of entities to allocate to the test subset, defaults to 0.2.

0.2
valid_size float

Proportion of entities to allocate to the validation subset, defaults to 0.0.

0.0
threshold float

Similarity threshold used for cluster formation, defaults to 0.3.

0.3
verbose int

Verbosity level. Higher values print detailed partition proportions, defaults to 0.

0
n_bins int

Number of bins used when discretizing labels for balancing, defaults to 10.

10
filter_smaller Optional[bool]

Whether to filter smaller similarity values during clustering, defaults to True.

True

Returns:

Type Description
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]

If valid_size > 0 returns train, test, valid subsets plus cluster assignments. Otherwise returns train, test subsets plus cluster assignments.

ccpart(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)

Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Smallest clusters are iteratively assigned to testing.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the dataset to be partitioned.

required
sim_df DataFrame

DataFrame representing precomputed pairwise similarities between samples.

required
field_name str

Name of the column in df used for clustering; if None, uses sim_df directly.

None
label_name str

Name of the label column for balancing partitions; if None, no balancing is performed.

None
test_size float

Fraction of the dataset to allocate to the test set.

0.2
valid_size float

Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split.

0.0
threshold float

Similarity threshold for connecting components when clustering.

0.3
verbose int

Verbosity level for logging (higher values provide more detailed output).

0
n_bins int

Number of bins to discretize continuous labels into for balancing purposes.

10
filter_smaller Optional[bool]

Whether with the similarity metric less is less similar.

True

Returns:

Type Description
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
  • If valid_size > 0: returns (train_indices, test_indices, valid_indices, cluster_assignments) - Otherwise: returns (train_indices, test_indices, cluster_assignments)

ccpart_random(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, seed=0, n_bins=10, filter_smaller=True)

Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Cluesters are assigned to testing randomly.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the dataset to be partitioned.

required
sim_df DataFrame

DataFrame representing precomputed pairwise similarities between samples.

required
field_name str

Name of the column in df used for clustering; if None, uses sim_df directly.

None
label_name str

Name of the label column for balancing partitions; if None, no balancing is performed.

None
test_size float

Fraction of the dataset to allocate to the test set.

0.2
valid_size float

Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split.

0.0
threshold float

Similarity threshold for connecting components when clustering.

0.3
verbose int

Verbosity level for logging (higher values provide more detailed output).

0
n_bins int

Number of bins to discretize continuous labels into for balancing purposes.

10
filter_smaller Optional[bool]

Whether with the similarity metric less is less similar.

True

Returns:

Type Description
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
  • If valid_size > 0: returns (train_indices, test_indices, valid_indices, cluster_assignments) - Otherwise: returns (train_indices, test_indices, cluster_assignments)

cdhit_part(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)

Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Smallest clusters are iteratively assigned to testing.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the dataset to be partitioned.

required
sim_df DataFrame

DataFrame representing precomputed pairwise similarities between samples.

required
field_name str

Name of the column in df used for clustering; if None, uses sim_df directly.

None
label_name str

Name of the label column for balancing partitions; if None, no balancing is performed.

None
test_size float

Fraction of the dataset to allocate to the test set.

0.2
valid_size float

Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split.

0.0
threshold float

Similarity threshold for connecting components when clustering.

0.3
verbose int

Verbosity level for logging (higher values provide more detailed output).

0
n_bins int

Number of bins to discretize continuous labels into for balancing purposes.

10
filter_smaller Optional[bool]

Whether with the similarity metric less is less similar.

True

Returns:

Type Description
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
  • If valid_size > 0: returns (train_indices, test_indices, valid_indices, cluster_assignments) - Otherwise: returns (train_indices, test_indices, cluster_assignments)

graph_part(df, sim_df, label_name=None, test_size=0.0, valid_size=0.0, threshold=0.3, verbose=2, n_parts=10, filter_smaller=True)

Builds a graph from the provided similarity matrix, applies a limited agglomerative clustering algorithm, balances clusters across partitions, and performs iterative reassignment to minimize forbidden edges. The final output can optionally be split into train/test/validation subsets based on cluster proportions.

Reference: Teufel F, Gíslason MH, Almagro Armenteros JJ, Johansen AR, Winther O, Nielsen H. GraphPart: homology partitioning for biological sequence analysis. NAR genomics and bioinformatics. 2023 Dec 1;5(4):lqad088.

Code adapted and generalized from the project Github repository: https://github.com/graph-part/graph-part

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the entities to partition.

required
sim_df DataFrame

Pairwise similarity DataFrame used to build the graph.

required
label_name str

Optional column name containing entity labels used to guide cluster assignment for balancing, defaults to None.

None
test_size float

Proportion of entities to allocate to the test split, defaults to 0.0.

0.0
valid_size float

Proportion of entities to allocate to the validation split (applied only after test split assignment), defaults to 0.0.

0.0
threshold float

Similarity threshold used to define edges in the graph and guide clustering, defaults to 0.3.

0.3
verbose int

Verbosity level. Values above 1 enable progress information, defaults to 2.

2
n_parts int

Number of partitions (clusters) to generate, defaults to 10.

10
filter_smaller Optional[bool]

If True, edges with similarity >= threshold are kept. If False, edges <= threshold are kept instead, defaults to True.

True

Returns:

Type Description
Union[ np.ndarray, Tuple[List[int], List[int], np.ndarray], Tuple[List[int], List[int], List[int], np.ndarray] ]

If test_size and valid_size are both 0.0, returns an array of partition assignments (with -1 for removed nodes). Otherwise returns train/test or train/test/valid index lists along with full cluster labels.

random_partition(df, test_size, random_state=42, **kwargs)

Use random partitioning algorithm to generate training and evaluation subsets. Wrapper around the train_test_split function from scikit-learn.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with the entities to partition

required
test_size float

Proportion of entities to be allocated to test subset, defaults to 0.2

required
random_state int

Seed for pseudo-random number generator algorithm, defaults to 42

42

Returns:

Type Description
Tuple[pd.DataFrame, pd.DataFrame]

A tuple with the indexes of training and evaluation samples.

scaffold(df, field_name, label_name=None, test_size=0.0, valid_size=0.0, n_bins=10, verbose=1)

Partition a dataset based on Bemis-Murcko scaffolds.

Generates Bemis-Murcko scaffolds from the molecular SMILES in field_name and assigns clusters based on unique scaffolds. Optionally discretizes labels for balancing and partitions the dataset into train/test/validation subsets using smallest_assignment. Prints warnings if partition sizes deviate significantly from expectations.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing molecular data.

required
field_name str

Column name containing SMILES strings used to generate scaffolds.

required
label_name str

Optional column name containing labels used for balancing, defaults to None.

None
test_size float

Proportion of entities to allocate to the test subset, defaults to 0.0.

0.0
valid_size float

Proportion of entities to allocate to the validation subset, defaults to 0.0.

0.0
n_bins int

Number of bins used when discretizing labels for balancing, defaults to 10.

10
verbose int

Verbosity level. When > 2 prints detailed proportions, defaults to 1.

1

Returns:

Type Description
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]

If valid_size > 0 returns train, test, valid subsets plus cluster assignments. Otherwise returns train, test subsets plus cluster assignments.

sim_umap(df, sim_df, field_name=None, label_name=None, test_size=0.0, valid_size=0.0, threshold=0.3, verbose=2, n_clusters=10, n_neighbors=15, n_components=2, n_pcs=50, min_dist=0.1, boolean_out=True, n_bins=10)

UMAP-based partitioning using an external similarity matrix.

It's a generalization of the UMAP_original algorithm from Guo et al., 2025, but extended to work on similarity matrices, instead of binary fingerprints.

Generates clusters using UMAP while incorporating an external similarity matrix (sim_df). Optionally discretizes labels for balancing and partitions the dataset using smallest_assignment into train/test/validation subsets. Prints warnings if achieved proportions deviate significantly from expectations.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the entities to cluster and partition.

required
sim_df DataFrame

Similarity matrix provided as a Polars DataFrame, used to augment UMAP clustering.

required
field_name str

Optional name of a column with feature vectors used by UMAP. If None, clustering relies entirely on sim_df, defaults to None.

None
label_name str

Optional column name containing labels used for balancing partitions, defaults to None.

None
test_size float

Proportion of entities to allocate to the test subset, defaults to 0.0.

0.0
valid_size float

Proportion of entities to allocate to the validation subset, defaults to 0.0.

0.0
threshold float

Threshold used by the UMAP graph clustering step, defaults to 0.3.

0.3
verbose int

Verbosity level. When > 2 prints detailed proportions, defaults to 2.

2
n_clusters int

Desired number of clusters to generate, defaults to 10.

10
n_neighbors int

UMAP n_neighbors parameter, defaults to 15.

15
n_components int

Number of UMAP embedding dimensions, defaults to 2.

2
n_pcs int

Number of principal components to compute before UMAP, defaults to 50.

50
min_dist float

UMAP min_dist parameter controlling embedding tightness, defaults to 0.1.

0.1
boolean_out bool

Whether to convert the similarity thresholding output to boolean values, defaults to True.

True
n_bins int

Number of bins used when discretizing labels for balancing, defaults to 10.

10

Returns:

Type Description
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]

If valid_size > 0 returns train, test, valid subsets plus cluster assignments. Otherwise returns train, test subsets plus cluster assignments.

smallest_assignment(clusters, labels, size, valid_size, test_size)

Assigns iteratively the smallest subclusters to the test subset, until it reaches the desired size.

Parameters:

Name Type Description Default
list_ids list[str]

Ordered list of item identifiers to assign.

required
partition_lengths ndarray

Desired number of items for each partition.

required
max_length_per_partition (int, optional)

Maximum allowed size for any partition, values in partition_lengths exceeding this are clipped, defaults to 100000000.

required

Returns:

Type Description
Tuple[np.ndarray, np.ndarray, np.ndarray]

The indices for training, testing and valiation subsets. In that order.

umap_original(df, field_name, label_name=None, test_size=0.0, valid_size=0.0, threshold=0.3, verbose=2, n_clusters=10, n_neighbors=15, n_components=2, n_pcs=50, min_dist=0.1, radius=2, bits=1024, n_bins=10, **kwargs)

Computes UMAP embeddings using the specified feature column, generates cluster assignments, discretizes labels (if provided), and then distributes the instances into train/test/validation partitions using the smallest_assignment strategy. Optional warnings are printed if resulting partitions deviate significantly from expected proportions.

Reference: Guo Q, Hernandez-Hernandez S, Ballester PJ. UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines. Journal of Cheminformatics. 2025 Jun 10;17(1):94.

Code adapted from Pat Walter's useful rdkit utils Github Repository: https://github.com/PatWalters/useful_rdkit_utils

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing entities to cluster and partition.

required
field_name str

Name of the column containing the features used by UMAP.

required
label_name str

Optional column name with labels used for balancing partitions, defaults to None.

None
test_size float

Proportion of entities to place in the test subset, defaults to 0.0.

0.0
valid_size float

Proportion of entities to place in the validation subset, defaults to 0.0.

0.0
threshold float

Threshold applied during UMAP-based graph clustering, defaults to 0.3.

0.3
verbose int

Verbosity level. Values > 2 print partition proportions, defaults to 2.

2
n_clusters int

Desired number of clusters to generate using UMAP, defaults to 10.

10
n_neighbors int

UMAP n_neighbors parameter, defaults to 15.

15
n_components int

Number of UMAP embedding dimensions, defaults to 2.

2
n_pcs int

Number of principal components to compute before UMAP, defaults to 50.

50
min_dist float

UMAP min_dist parameter controlling embedding tightness, defaults to 0.1.

0.1
radius int

Radius value used by the UMAP graph construction, defaults to 2.

2
bits int

Dimensionality of any hashing step used for vector representations, defaults to 1024.

1024
n_bins int

Number of bins used when discretizing labels for balancing, defaults to 10.

10
kwargs dict

Additional keyword arguments passed to underlying UMAP or clustering routines.

{}

Returns:

Type Description
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]

If valid_size > 0 returns train, test, valid partitions plus cluster assignments. Otherwise returns train, test partitions plus cluster assignments.