Partitioning algorithms¶
bitbirch(df, sim_df=None, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, branching_factor=50, n_bins=10, radius=2, bits=1024, n_clusters=20, **kwargs)
¶
Partition a dataset using the BitBirch clustering algorithm.
Generates clusters based on molecular features or similarity using
the BitBirch algorithm. Labels can be optionally discretized for
balancing, and the dataset is partitioned into train/test/validation
subsets using smallest_assignment. Prints warnings if the partition
sizes deviate from expectations.
Reference: Pérez KL, Jung V, Chen L, Huddleston K, Miranda-Quintana RA. BitBIRCH: efficient clustering of large molecular libraries. Digital Discovery. 2025;4(4):1042-51.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing molecular entities. |
required |
sim_df
|
DataFrame
|
Optional DataFrame containing pairwise similarity scores between entities, defaults to None. |
None
|
field_name
|
str
|
Optional column name used for clustering. |
None
|
label_name
|
str
|
Optional column name for labels used for balancing, defaults to None. |
None
|
test_size
|
float
|
Proportion of entities to allocate to the test subset, defaults to 0.2. |
0.2
|
valid_size
|
float
|
Proportion of entities to allocate to the validation subset, defaults to 0.0. |
0.0
|
threshold
|
float
|
Similarity threshold used for clustering, defaults to 0.3. |
0.3
|
verbose
|
int
|
Verbosity level. Higher values print detailed partition proportions, defaults to 0. |
0
|
branching_factor
|
int
|
Branching factor for BitBirch clustering, defaults to 50. |
50
|
n_bins
|
int
|
Number of bins used when discretizing labels for balancing, defaults to 10. |
10
|
radius
|
int
|
Neighborhood radius for BitBirch clustering, defaults to 2. |
2
|
bits
|
int
|
Number of bits used in fingerprint representation, defaults to 1024. |
1024
|
n_clusters
|
int
|
Number of clusters to generate, defaults to 20. |
20
|
Returns:
| Type | Description |
|---|---|
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]
|
If |
butina(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)
¶
Partition a dataset using a Butina-style greedy clustering algorithm.
Generates clusters based on molecular similarity using a greedy
cover set approach (Butina clustering). Labels can be optionally
discretized for balancing, and the dataset is partitioned into
train/test/validation subsets using smallest_assignment. Prints
warnings if the partition sizes deviate from expectations.
Generalized to work on similarity matrices rather than fingerprints directly.
Reference: Butina D. Unsupervised data base clustering based on daylight's fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. Journal of Chemical Information and Computer Sciences. 1999 Jul 26;39(4):747-50.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing molecular entities. |
required |
sim_df
|
DataFrame
|
DataFrame or matrix containing pairwise similarity scores between entities. |
required |
field_name
|
str
|
Optional column name used for clustering. If None, clustering is based solely on |
None
|
label_name
|
str
|
Optional column name for labels used for balancing, defaults to |
None
|
test_size
|
float
|
Proportion of entities to allocate to the test subset, defaults to 0.2. |
0.2
|
valid_size
|
float
|
Proportion of entities to allocate to the validation subset, defaults to 0.0. |
0.0
|
threshold
|
float
|
Similarity threshold used for cluster formation, defaults to 0.3. |
0.3
|
verbose
|
int
|
Verbosity level. Higher values print detailed partition proportions, defaults to 0. |
0
|
n_bins
|
int
|
Number of bins used when discretizing labels for balancing, defaults to 10. |
10
|
filter_smaller
|
Optional[bool]
|
Whether to filter smaller similarity values during clustering, defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]
|
If |
ccpart(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)
¶
Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Smallest clusters are iteratively assigned to testing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing the dataset to be partitioned. |
required |
sim_df
|
DataFrame
|
DataFrame representing precomputed pairwise similarities between samples. |
required |
field_name
|
str
|
Name of the column in |
None
|
label_name
|
str
|
Name of the label column for balancing partitions; if None, no balancing is performed. |
None
|
test_size
|
float
|
Fraction of the dataset to allocate to the test set. |
0.2
|
valid_size
|
float
|
Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split. |
0.0
|
threshold
|
float
|
Similarity threshold for connecting components when clustering. |
0.3
|
verbose
|
int
|
Verbosity level for logging (higher values provide more detailed output). |
0
|
n_bins
|
int
|
Number of bins to discretize continuous labels into for balancing purposes. |
10
|
filter_smaller
|
Optional[bool]
|
Whether with the similarity metric less is less similar. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
|
|
ccpart_random(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, seed=0, n_bins=10, filter_smaller=True)
¶
Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Cluesters are assigned to testing randomly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing the dataset to be partitioned. |
required |
sim_df
|
DataFrame
|
DataFrame representing precomputed pairwise similarities between samples. |
required |
field_name
|
str
|
Name of the column in |
None
|
label_name
|
str
|
Name of the label column for balancing partitions; if None, no balancing is performed. |
None
|
test_size
|
float
|
Fraction of the dataset to allocate to the test set. |
0.2
|
valid_size
|
float
|
Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split. |
0.0
|
threshold
|
float
|
Similarity threshold for connecting components when clustering. |
0.3
|
verbose
|
int
|
Verbosity level for logging (higher values provide more detailed output). |
0
|
n_bins
|
int
|
Number of bins to discretize continuous labels into for balancing purposes. |
10
|
filter_smaller
|
Optional[bool]
|
Whether with the similarity metric less is less similar. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
|
|
cdhit_part(df, sim_df, field_name=None, label_name=None, test_size=0.2, valid_size=0.0, threshold=0.3, verbose=0, n_bins=10, filter_smaller=True)
¶
Partitions a dataset into training, testing, and optional validation sets based on connected component clustering using a similarity matrix. Ensures clusters are kept intact across splits and optionally balances label distributions across partitions. Smallest clusters are iteratively assigned to testing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing the dataset to be partitioned. |
required |
sim_df
|
DataFrame
|
DataFrame representing precomputed pairwise similarities between samples. |
required |
field_name
|
str
|
Name of the column in |
None
|
label_name
|
str
|
Name of the label column for balancing partitions; if None, no balancing is performed. |
None
|
test_size
|
float
|
Fraction of the dataset to allocate to the test set. |
0.2
|
valid_size
|
float
|
Fraction of the dataset to allocate to the validation set; set to 0.0 to skip validation split. |
0.0
|
threshold
|
float
|
Similarity threshold for connecting components when clustering. |
0.3
|
verbose
|
int
|
Verbosity level for logging (higher values provide more detailed output). |
0
|
n_bins
|
int
|
Number of bins to discretize continuous labels into for balancing purposes. |
10
|
filter_smaller
|
Optional[bool]
|
Whether with the similarity metric less is less similar. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Tuple[list, list, list], Tuple[list, list, list, list]]
|
|
graph_part(df, sim_df, label_name=None, test_size=0.0, valid_size=0.0, threshold=0.3, verbose=2, n_parts=10, filter_smaller=True)
¶
Builds a graph from the provided similarity matrix, applies a limited agglomerative clustering algorithm, balances clusters across partitions, and performs iterative reassignment to minimize forbidden edges. The final output can optionally be split into train/test/validation subsets based on cluster proportions.
Reference: Teufel F, Gíslason MH, Almagro Armenteros JJ, Johansen AR, Winther O, Nielsen H. GraphPart: homology partitioning for biological sequence analysis. NAR genomics and bioinformatics. 2023 Dec 1;5(4):lqad088.
Code adapted and generalized from the project Github repository: https://github.com/graph-part/graph-part
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing the entities to partition. |
required |
sim_df
|
DataFrame
|
Pairwise similarity DataFrame used to build the graph. |
required |
label_name
|
str
|
Optional column name containing entity labels used to guide cluster assignment for balancing, defaults to |
None
|
test_size
|
float
|
Proportion of entities to allocate to the test split, defaults to |
0.0
|
valid_size
|
float
|
Proportion of entities to allocate to the validation split (applied only after test split assignment), defaults to |
0.0
|
threshold
|
float
|
Similarity threshold used to define edges in the graph and guide clustering, defaults to |
0.3
|
verbose
|
int
|
Verbosity level. Values above 1 enable progress information, defaults to |
2
|
n_parts
|
int
|
Number of partitions (clusters) to generate, defaults to |
10
|
filter_smaller
|
Optional[bool]
|
If |
True
|
Returns:
| Type | Description |
|---|---|
Union[ np.ndarray, Tuple[List[int], List[int], np.ndarray], Tuple[List[int], List[int], List[int], np.ndarray] ]
|
If |
random_partition(df, test_size, random_state=42, **kwargs)
¶
Use random partitioning algorithm
to generate training and evaluation subsets.
Wrapper around the train_test_split function
from scikit-learn.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with the entities to partition |
required |
test_size
|
float
|
Proportion of entities to be allocated to test subset, defaults to 0.2 |
required |
random_state
|
int
|
Seed for pseudo-random number generator algorithm, defaults to 42 |
42
|
Returns:
| Type | Description |
|---|---|
Tuple[pd.DataFrame, pd.DataFrame]
|
A tuple with the indexes of training and evaluation samples. |
scaffold(df, field_name, label_name=None, test_size=0.0, valid_size=0.0, n_bins=10, verbose=1)
¶
Partition a dataset based on Bemis-Murcko scaffolds.
Generates Bemis-Murcko scaffolds from the molecular SMILES in field_name
and assigns clusters based on unique scaffolds. Optionally discretizes labels
for balancing and partitions the dataset into train/test/validation subsets
using smallest_assignment. Prints warnings if partition sizes deviate
significantly from expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing molecular data. |
required |
field_name
|
str
|
Column name containing SMILES strings used to generate scaffolds. |
required |
label_name
|
str
|
Optional column name containing labels used for balancing, defaults to |
None
|
test_size
|
float
|
Proportion of entities to allocate to the test subset, defaults to |
0.0
|
valid_size
|
float
|
Proportion of entities to allocate to the validation subset, defaults to |
0.0
|
n_bins
|
int
|
Number of bins used when discretizing labels for balancing, defaults to |
10
|
verbose
|
int
|
Verbosity level. When |
1
|
Returns:
| Type | Description |
|---|---|
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]
|
If |
sim_umap(df, sim_df, field_name=None, label_name=None, test_size=0.0, valid_size=0.0, threshold=0.3, verbose=2, n_clusters=10, n_neighbors=15, n_components=2, n_pcs=50, min_dist=0.1, boolean_out=True, n_bins=10)
¶
UMAP-based partitioning using an external similarity matrix.
It's a generalization of the UMAP_original algorithm from Guo et al., 2025,
but extended to work on similarity matrices, instead of binary fingerprints.
Generates clusters using UMAP while incorporating an external similarity
matrix (sim_df). Optionally discretizes labels for balancing and partitions
the dataset using smallest_assignment into train/test/validation subsets.
Prints warnings if achieved proportions deviate significantly from expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing the entities to cluster and partition. |
required |
sim_df
|
DataFrame
|
Similarity matrix provided as a Polars DataFrame, used to augment UMAP clustering. |
required |
field_name
|
str
|
Optional name of a column with feature vectors used by UMAP. If |
None
|
label_name
|
str
|
Optional column name containing labels used for balancing partitions, defaults to |
None
|
test_size
|
float
|
Proportion of entities to allocate to the test subset, defaults to |
0.0
|
valid_size
|
float
|
Proportion of entities to allocate to the validation subset, defaults to |
0.0
|
threshold
|
float
|
Threshold used by the UMAP graph clustering step, defaults to |
0.3
|
verbose
|
int
|
Verbosity level. When |
2
|
n_clusters
|
int
|
Desired number of clusters to generate, defaults to |
10
|
n_neighbors
|
int
|
UMAP |
15
|
n_components
|
int
|
Number of UMAP embedding dimensions, defaults to |
2
|
n_pcs
|
int
|
Number of principal components to compute before UMAP, defaults to |
50
|
min_dist
|
float
|
UMAP |
0.1
|
boolean_out
|
bool
|
Whether to convert the similarity thresholding output to boolean values, defaults to |
True
|
n_bins
|
int
|
Number of bins used when discretizing labels for balancing, defaults to |
10
|
Returns:
| Type | Description |
|---|---|
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]
|
If |
smallest_assignment(clusters, labels, size, valid_size, test_size)
¶
Assigns iteratively the smallest subclusters to the test subset, until it reaches the desired size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
list_ids
|
list[str]
|
Ordered list of item identifiers to assign. |
required |
partition_lengths
|
ndarray
|
Desired number of items for each partition. |
required |
max_length_per_partition
|
(int, optional)
|
Maximum allowed size for any partition, values in |
required |
Returns:
| Type | Description |
|---|---|
Tuple[np.ndarray, np.ndarray, np.ndarray]
|
The indices for training, testing and valiation subsets. In that order. |
umap_original(df, field_name, label_name=None, test_size=0.0, valid_size=0.0, threshold=0.3, verbose=2, n_clusters=10, n_neighbors=15, n_components=2, n_pcs=50, min_dist=0.1, radius=2, bits=1024, n_bins=10, **kwargs)
¶
Computes UMAP embeddings using the specified feature column, generates cluster
assignments, discretizes labels (if provided), and then distributes the
instances into train/test/validation partitions using the
smallest_assignment strategy. Optional warnings are printed if resulting
partitions deviate significantly from expected proportions.
Reference: Guo Q, Hernandez-Hernandez S, Ballester PJ. UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines. Journal of Cheminformatics. 2025 Jun 10;17(1):94.
Code adapted from Pat Walter's useful rdkit utils Github Repository: https://github.com/PatWalters/useful_rdkit_utils
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing entities to cluster and partition. |
required |
field_name
|
str
|
Name of the column containing the features used by UMAP. |
required |
label_name
|
str
|
Optional column name with labels used for balancing partitions, defaults to |
None
|
test_size
|
float
|
Proportion of entities to place in the test subset, defaults to |
0.0
|
valid_size
|
float
|
Proportion of entities to place in the validation subset, defaults to |
0.0
|
threshold
|
float
|
Threshold applied during UMAP-based graph clustering, defaults to |
0.3
|
verbose
|
int
|
Verbosity level. Values |
2
|
n_clusters
|
int
|
Desired number of clusters to generate using UMAP, defaults to |
10
|
n_neighbors
|
int
|
UMAP |
15
|
n_components
|
int
|
Number of UMAP embedding dimensions, defaults to |
2
|
n_pcs
|
int
|
Number of principal components to compute before UMAP, defaults to |
50
|
min_dist
|
float
|
UMAP |
0.1
|
radius
|
int
|
Radius value used by the UMAP graph construction, defaults to |
2
|
bits
|
int
|
Dimensionality of any hashing step used for vector representations, defaults to |
1024
|
n_bins
|
int
|
Number of bins used when discretizing labels for balancing, defaults to |
10
|
kwargs
|
dict
|
Additional keyword arguments passed to underlying UMAP or clustering routines. |
{}
|
Returns:
| Type | Description |
|---|---|
Union[ Tuple[List[int], List[int], List[int], np.ndarray], Tuple[List[int], List[int], np.ndarray] ]
|
If |