Similarity calculation¶

`embedding_similarity(query_embds, target_embds=None, sim_function='cosine', threads=cpu_count(), threshold=0.0, verbose=3, save_alignment=False, filename=None, **kwargs)` ¶

Calculates pairwise similarity between embeddings in query_embds and target_embds using specified similarity functions. Supports parallel processing to handle large datasets efficiently.

Parameters:

Name	Type	Description	Default
`query_embds`	`ndarray`	Array of embeddings for the query set. Each row should represent a single embedding.	required
`target_embds`	`Optional[ndarray]`	Array of embeddings for the target set. If None, self-comparison of `query_embds` is performed.	`None`
`sim_function`	`Union[str, Callable]`	Similarity function to use for pairwise comparison. Default is 'cosine'. Can be either a string specifying a built-in function or a custom callable.	`'cosine'`
`threads`	`int`	Number of CPU threads for parallel processing. Defaults to the system CPU count.	`cpu_count()`
`threshold`	`float`	Minimum similarity score required to include a pair in the results. Defaults to 0.0.	`0.0`
`save_alignment`	`bool`	If True, saves the alignment results to a compressed CSV file.	`False`
`filename`	`str`	Name for the output file if `save_alignment` is True. Defaults to a timestamp if None.	`None`
`**kwargs`		Additional keyword arguments for compatibility.	`{}`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with columns `query`, `target`, and `metric`, where each row represents a pairwise similarity score above the specified `threshold`.

Raises:

Type	Description
`RuntimeError`	If any exception occurs in a thread during similarity calculation.
`KeyError`	If the specified `sim_function` is not supported.

`molecular_similarity(df_query, df_target=None, field_name='smiles', sim_function='jaccard', fingerprint='mapc', bits=1024, radius=2, threshold=0.0, threads=cpu_count(), verbose=3, save_alignment=False, filename=None, **kwargs)` ¶

Calculates pairwise molecular similarity between query and target molecules using specified fingerprint and similarity functions. Uses RDKit for molecular fingerprinting and similarity calculations.

Parameters:

Name	Type	Description	Default
`df_query`	`DataFrame`	DataFrame containing SMILES strings of query molecules. Each row should have a column specified by `field_name` with SMILES strings.	required
`df_target`	`DataFrame`	DataFrame containing SMILES strings of target molecules. If None, self-comparison of `df_query` is performed.	`None`
`field_name`	`str`	Column name in `df_query` and `df_target` that contains SMILES strings. Defaults to 'smiles'.	`'smiles'`
`sim_function`	`str`	Similarity function to use for pairwise comparison. Options include 'tanimoto', 'dice', 'sokal', 'rogot-goldberg', 'jaccard', 'canberra', and 'cosine'. Defaults to 'jaccard'.	`'jaccard'`
`fingerprint`	`str`	Type of fingerprint to use, options are 'ecfp' (Extended-Connectivity Fingerprint), 'maccs' (MACCS keys), 'mapc' (requires the mapchiral package), or `lipinski. Defaults to 'mapc'.	`'mapc'`
`bits`	`int`	Size of the fingerprint bit vector, applicable to `ecfp` and `mapc`. Defaults to 1024.	`1024`
`radius`	`int`	Radius for the ECFP fingerprint, applicable to `ecfp` and `mapc`. Defaults to 2.	`2`
`threshold`	`float`	Minimum similarity score required to include a pair in the results. Defaults to 0.0.	`0.0`
`threads`	`int`	Number of CPU threads for parallel processing. Defaults to the system CPU count.	`cpu_count()`
`verbose`	`int`	Verbosity level, where higher values increase output detail. Defaults to 0.	`3`
`save_alignment`	`bool`	If True, saves the alignment results to a compressed CSV file.	`False`
`filename`	`str`	Name for the output file if `save_alignment` is True. Defaults to a timestamp if None.	`None`
`**kwargs`		Additional keyword arguments for compatibility.	`{}`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with columns `query`, `target`, and `metric`, where each row represents a pairwise similarity score above the specified `threshold`.

Raises:

Type	Description
`ImportError`	If RDKit (or mapchiral, if used with 'mapc') is not installed.
`ValueError`	If `field_name` is missing from `df_query` or `df_target`.
`NotImplementedError`	If `sim_function` is not supported by the function.

`protein_structure_similarity(df_query, df_target=None, field_name='structure', prefilter=True, denominator='shortest', representation='3di+aa', threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, **kwargs)` ¶

Calculates pairwise structural similarity between query and target protein structures using Foldseek. Supports alignment based on various representations, including 3D alignment, TM alignment, and combined 3D and amino acid alignments.

Parameters:

Name	Type	Description	Default
`df_query`	`DataFrame`	DataFrame containing query protein structures. Each row should have a column specified by `field_name` with paths to PDB files.	required
`df_target`	`DataFrame`	DataFrame containing target protein structures, with each row holding paths to PDB files in `field_name`. If None, self-comparison of `df_query` is performed.	`None`
`field_name`	`str`	Column name in `df_query` and `df_target` with paths to PDB structure files. Defaults to 'structure'.	`'structure'`
`prefilter`	`bool`	Enables prefiltering to reduce computation. Defaults to True.	`True`
`denominator`	`str`	Determines similarity normalization, using "shortest" (default), "longest", or the number of aligned residues (`n_aligned`).	`'shortest'`
`representation`	`str`	Alignment representation mode, with options '3di', 'TM', or '3di+aa'. Defaults to '3di+aa'.	`'3di+aa'`
`threshold`	`float`	Minimum similarity metric required to include an alignment in the results. Defaults to 0.0.	`0.0`
`threads`	`int`	Number of CPU threads for parallel processing. Defaults to system CPU count.	`cpu_count()`
`verbose`	`int`	Verbosity level for process logging, where higher values increase output detail.	`0`
`save_alignment`	`bool`	If True, saves alignment results to a compressed CSV file.	`False`
`filename`	`str`	Name for the output file if `save_alignment` is True. Defaults to a timestamp if None.	`None`
`**kwargs`		Additional keyword arguments for compatibility.	`{}`

Returns:

Type	Description
`Union[pd.DataFrame, np.ndarray]`	DataFrame with columns `query`, `target`, and `metric`, where each row represents an alignment with a similarity metric above `threshold`. Returns the metric value determined by `representation`.

Raises:

Type	Description
`ImportError`	If Foldseek is not installed or accessible in the system PATH.
`ValueError`	If `field_name` is missing from `df_query` or `df_target`.

`sequence_similarity_mmseqs(df_query, df_target=None, field_name='sequence', prefilter=True, denominator='shortest', threads=cpu_count(), is_nucleotide=False, threshold=0.0, verbose=0, save_alignment=False, filename=None)` ¶

Calculate pairwise sequence similarity between query and target sequences using MMSeqs2, with optional prefiltering for efficiency. Designed for parallel execution and customizable alignment parameters.

Parameters:

Name	Type	Description	Default
`df_query`	`DataFrame`	DataFrame containing the query sequences. Each row should have a column specified by `field_name` with sequence strings.	required
`df_target`	`DataFrame`	DataFrame with target sequences, where each row has a `field_name` column containing sequence strings. If None, `df_query` will be used for self-comparisons.	`None`
`field_name`	`str`	Column name in `df_query` and `df_target` holding the sequence data to be aligned. Defaults to 'sequence'.	`'sequence'`
`prefilter`	`bool`	If True, performs an initial filtering step to reduce the number of comparisons.	`True`
`denominator`	`str`	Determines how similarity is calculated, using either "shortest" (default), "longest", or the number of aligned residues (`n_aligned`).	`'shortest'`
`threads`	`int`	Number of threads for parallel processing. Defaults to system CPU count.	`cpu_count()`
`is_nucleotide`	`bool`	Set to True if sequences are nucleotide-based. Defaults to False (for protein sequences).	`False`
`threshold`	`float`	Minimum similarity metric for alignment entries to be included in the output. Defaults to 0.0.	`0.0`
`verbose`	`int`	Verbosity level, where 0 is silent and higher levels increase detail in logging.	`0`
`save_alignment`	`bool`	If True, saves the resulting DataFrame to a compressed CSV file.	`False`
`filename`	`str`	Filename for saving the alignment results if `save_alignment` is True. If None, a timestamp is used as the filename.	`None`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with columns `query`, `target`, and `metric`, where each row represents an alignment result with similarity metric above the `threshold`.

Raises:

Type	Description
`RuntimeError`	If MMSeqs2 is not installed or is unavailable in the system PATH.
`ValueError`	If `field_name` is not found in `df_query` or `df_target`.

`sequence_similarity_needle(df_query, df_target=None, field_name='sequence', denominator='shortest', is_nucleotide=False, config=None, threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None)` ¶

Calculate pairwise sequence similarity between query and target sequences using the EMBOSS needleall tool. This function is designed for efficient parallel processing and supports custom alignment parameters.

Parameters:

Name	Type	Description	Default
`df_query`	`DataFrame`	DataFrame containing the query sequences. Each row should have a column specified by `field_name` that contains sequence strings.	required
`df_target`	`DataFrame`	DataFrame containing the target sequences, with a column specified by `field_name` containing sequence strings. If None, `df_query` will be used as the target DataFrame, performing self-comparisons.	`None`
`field_name`	`str`	Name of the column in `df_query` and `df_target` containing the sequence data to be compared. Defaults to 'sequence'.	`'sequence'`
`denominator`	`str`	Determines how similarity is calculated; options are "shortest" (default), "longest", or "average" sequence length between pairs.	`'shortest'`
`is_nucleotide`	`bool`	Indicates if the sequences are nucleotide sequences. If False, assumes sequences are protein-based.	`False`
`config`	`dict`	Dictionary of EMBOSS `needleall` alignment parameters, such as `gapopen`, `gapextend`, etc. If None, default configuration is used.	`None`
`threshold`	`float`	Minimum similarity metric required for alignment entries to be included in the output. Defaults to 0.0.	`0.0`
`threads`	`int`	Number of threads to use for parallel processing. Defaults to system CPU count.	`cpu_count()`
`verbose`	`int`	Verbosity level of function output; 0 is silent, higher numbers increase output detail.	`0`
`save_alignment`	`bool`	If True, saves the resulting DataFrame to a compressed CSV file.	`False`
`filename`	`str`	Filename for saving the alignment results if `save_alignment` is True. If None, a timestamp will be used as the filename.	`None`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with columns `query`, `target`, and `metric`, where each row represents a sequence alignment result, filtered by the specified `threshold`.

Raises:

Type	Description
`ImportError`	Raised if the `needleall` tool from EMBOSS is not installed.
`ValueError`	Raised if `field_name` is missing from `df_query` or `df_target`.
`RuntimeError`	Raised if any alignment job encounters an exception during processing.

`sequence_similarity_peptides(df_query, df_target=None, field_name='sequence', denominator='shortest', threads=cpu_count(), threshold=0.0, verbose=0, save_alignment=False, filename=None)` ¶

Calculates pairwise sequence similarity between query and target peptide sequences using MMSeqs2. Sequences are divided into "small," "medium," and "normal" categories based on length, and each category is aligned with a specific method for optimal recall.

_small_alignment: For sequences with 8 or fewer residues, checks if one sequence is a subsequence of the other.
_medium_alignment: For sequences between 9 and 20 residues, uses a lower threshold with MMSeqs2 to filter alignments.
_normal_alignment: For sequences longer than 20 residues, performs full alignments with MMSeqs2.

Parameters:

Name	Type	Description	Default
`df_query`	`DataFrame`	DataFrame containing the query peptide sequences. Each row should have a column specified by `field_name` with peptide sequence strings.	required
`df_target`	`Optional[DataFrame]`	DataFrame with target peptide sequences, where each row has a `field_name` column containing sequence strings. If None, `df_query` will be used for self-comparisons.	`None`
`field_name`	`Optional[str]`	Column name in `df_query` and `df_target` holding the sequence data to be aligned. Defaults to 'sequence'.	`'sequence'`
`denominator`	`Optional[str]`	Determines how similarity is calculated, using either "shortest" (default), "longest", or the number of aligned residues (`n_aligned`).	`'shortest'`
`threads`	`Optional[int]`	Number of threads for parallel processing. Defaults to system CPU count.	`cpu_count()`
`threshold`	`Optional[float]`	Minimum similarity metric for alignment entries to be included in the output. Defaults to 0.0.	`0.0`
`verbose`	`Optional[int]`	Verbosity level, where 0 is silent and higher levels increase detail in logging.	`0`
`save_alignment`	`Optional[bool]`	If True, saves the resulting DataFrame to a compressed CSV file.	`False`
`filename`	`Optional[int]`	Filename for saving the alignment results if `save_alignment` is True. If None, a timestamp is used as the filename.	`None`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with columns `query`, `target`, and `metric`, where each row represents an alignment result with similarity metric above the `threshold`.

Raises:

Type	Description
`RuntimeError`	If MMSeqs2 is not installed or is unavailable in the system PATH.
`ValueError`	If `field_name` is not found in `df_query` or `df_target`.

`sim_df2mtx(sim_df, size_query=None, size_target=None, threshold=0.0, filter_smaller=True, boolean_out=True)` ¶

Converts a DataFrame of similarity scores into a sparse matrix representation, optionally filtering based on a similarity threshold and producing a boolean or numerical output.

Parameters:

Name	Type	Description	Default
`sim_df`	`DataFrame`	DataFrame containing similarity data with `query`, `target`, and `metric` columns.	required
`size_query`	`Optional[int]`	Total number of unique query indices, defining the first dimension of the matrix. Defaults to the number of unique queries in `sim_df`.	`None`
`size_target`	`Optional[int]`	Total number of unique target indices, defining the second dimension of the matrix. Defaults to `size_query`, assuming a square matrix.	`None`
`threshold`	`Optional[float]`	Similarity score threshold for filtering. Defaults to 0.0.	`0.0`
`filter_smaller`	`Optional[bool]`	If True, retains values above the threshold. If False, retains values below it.	`True`
`boolean_out`	`Optional[bool]`	If True, converts output to boolean values, representing presence/absence of similarity. If False, retains original similarity values.	`True`

Returns:

Type	Description
`spr.csr_matrix`	Symmetric sparse matrix of filtered similarity scores, either in boolean or numerical format.

Similarity calculation¶

embedding_similarity(query_embds, target_embds=None, sim_function='cosine', threads=cpu_count(), threshold=0.0, verbose=3, save_alignment=False, filename=None, **kwargs) ¶

molecular_similarity(df_query, df_target=None, field_name='smiles', sim_function='jaccard', fingerprint='mapc', bits=1024, radius=2, threshold=0.0, threads=cpu_count(), verbose=3, save_alignment=False, filename=None, **kwargs) ¶

protein_structure_similarity(df_query, df_target=None, field_name='structure', prefilter=True, denominator='shortest', representation='3di+aa', threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, **kwargs) ¶

sequence_similarity_mmseqs(df_query, df_target=None, field_name='sequence', prefilter=True, denominator='shortest', threads=cpu_count(), is_nucleotide=False, threshold=0.0, verbose=0, save_alignment=False, filename=None) ¶

sequence_similarity_needle(df_query, df_target=None, field_name='sequence', denominator='shortest', is_nucleotide=False, config=None, threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None) ¶

sequence_similarity_peptides(df_query, df_target=None, field_name='sequence', denominator='shortest', threads=cpu_count(), threshold=0.0, verbose=0, save_alignment=False, filename=None) ¶

sim_df2mtx(sim_df, size_query=None, size_target=None, threshold=0.0, filter_smaller=True, boolean_out=True) ¶

`embedding_similarity(query_embds, target_embds=None, sim_function='cosine', threads=cpu_count(), threshold=0.0, verbose=3, save_alignment=False, filename=None, **kwargs)` ¶

`molecular_similarity(df_query, df_target=None, field_name='smiles', sim_function='jaccard', fingerprint='mapc', bits=1024, radius=2, threshold=0.0, threads=cpu_count(), verbose=3, save_alignment=False, filename=None, **kwargs)` ¶

`protein_structure_similarity(df_query, df_target=None, field_name='structure', prefilter=True, denominator='shortest', representation='3di+aa', threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, **kwargs)` ¶

`sequence_similarity_mmseqs(df_query, df_target=None, field_name='sequence', prefilter=True, denominator='shortest', threads=cpu_count(), is_nucleotide=False, threshold=0.0, verbose=0, save_alignment=False, filename=None)` ¶

`sequence_similarity_needle(df_query, df_target=None, field_name='sequence', denominator='shortest', is_nucleotide=False, config=None, threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None)` ¶

`sequence_similarity_peptides(df_query, df_target=None, field_name='sequence', denominator='shortest', threads=cpu_count(), threshold=0.0, verbose=0, save_alignment=False, filename=None)` ¶

`sim_df2mtx(sim_df, size_query=None, size_target=None, threshold=0.0, filter_smaller=True, boolean_out=True)` ¶