Similarity calculation¶
embedding_similarity(query_embds, target_embds=None, sim_function='cosine', threads=cpu_count(), threshold=0.0, verbose=3, save_alignment=False, filename=None, **kwargs)
¶
Calculates pairwise similarity between embeddings in query_embds
and target_embds
using specified
similarity functions. Supports parallel processing to handle large datasets efficiently.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query_embds
|
ndarray
|
Array of embeddings for the query set. Each row should represent a single embedding. |
required |
target_embds
|
Optional[ndarray]
|
Array of embeddings for the target set. If None, self-comparison of |
None
|
sim_function
|
Union[str, Callable]
|
Similarity function to use for pairwise comparison. Default is 'cosine'. Can be either a string specifying a built-in function or a custom callable. |
'cosine'
|
threads
|
int
|
Number of CPU threads for parallel processing. Defaults to the system CPU count. |
cpu_count()
|
threshold
|
float
|
Minimum similarity score required to include a pair in the results. Defaults to 0.0. |
0.0
|
save_alignment
|
bool
|
If True, saves the alignment results to a compressed CSV file. |
False
|
filename
|
str
|
Name for the output file if |
None
|
**kwargs
|
Additional keyword arguments for compatibility. |
{}
|
Returns:
Type | Description |
---|---|
pl.DataFrame
|
DataFrame with columns |
Raises:
Type | Description |
---|---|
RuntimeError
|
If any exception occurs in a thread during similarity calculation. |
KeyError
|
If the specified |
molecular_similarity(df_query, df_target=None, field_name='smiles', sim_function='jaccard', fingerprint='mapc', bits=1024, radius=2, threshold=0.0, threads=cpu_count(), verbose=3, save_alignment=False, filename=None, **kwargs)
¶
Calculates pairwise molecular similarity between query and target molecules using specified fingerprint and similarity functions. Uses RDKit for molecular fingerprinting and similarity calculations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_query
|
DataFrame
|
DataFrame containing SMILES strings of query molecules. Each row should have a column specified by |
required |
df_target
|
DataFrame
|
DataFrame containing SMILES strings of target molecules. If None, self-comparison of |
None
|
field_name
|
str
|
Column name in |
'smiles'
|
sim_function
|
str
|
Similarity function to use for pairwise comparison. Options include 'tanimoto', 'dice', 'sokal', 'rogot-goldberg', 'jaccard', 'canberra', and 'cosine'. Defaults to 'jaccard'. |
'jaccard'
|
fingerprint
|
str
|
Type of fingerprint to use, options are 'ecfp' (Extended-Connectivity Fingerprint), 'maccs' (MACCS keys), 'mapc' (requires the mapchiral package), or `lipinski. Defaults to 'mapc'. |
'mapc'
|
bits
|
int
|
Size of the fingerprint bit vector, applicable to |
1024
|
radius
|
int
|
Radius for the ECFP fingerprint, applicable to |
2
|
threshold
|
float
|
Minimum similarity score required to include a pair in the results. Defaults to 0.0. |
0.0
|
threads
|
int
|
Number of CPU threads for parallel processing. Defaults to the system CPU count. |
cpu_count()
|
verbose
|
int
|
Verbosity level, where higher values increase output detail. Defaults to 0. |
3
|
save_alignment
|
bool
|
If True, saves the alignment results to a compressed CSV file. |
False
|
filename
|
str
|
Name for the output file if |
None
|
**kwargs
|
Additional keyword arguments for compatibility. |
{}
|
Returns:
Type | Description |
---|---|
pl.DataFrame
|
DataFrame with columns |
Raises:
Type | Description |
---|---|
ImportError
|
If RDKit (or mapchiral, if used with 'mapc') is not installed. |
ValueError
|
If |
NotImplementedError
|
If |
protein_structure_similarity(df_query, df_target=None, field_name='structure', prefilter=True, denominator='shortest', representation='3di+aa', threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, **kwargs)
¶
Calculates pairwise structural similarity between query and target protein structures using Foldseek. Supports alignment based on various representations, including 3D alignment, TM alignment, and combined 3D and amino acid alignments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_query
|
DataFrame
|
DataFrame containing query protein structures. Each row should have a column specified by |
required |
df_target
|
DataFrame
|
DataFrame containing target protein structures, with each row holding paths to PDB files in |
None
|
field_name
|
str
|
Column name in |
'structure'
|
prefilter
|
bool
|
Enables prefiltering to reduce computation. Defaults to True. |
True
|
denominator
|
str
|
Determines similarity normalization, using "shortest" (default), "longest", or the number of aligned residues ( |
'shortest'
|
representation
|
str
|
Alignment representation mode, with options '3di', 'TM', or '3di+aa'. Defaults to '3di+aa'. |
'3di+aa'
|
threshold
|
float
|
Minimum similarity metric required to include an alignment in the results. Defaults to 0.0. |
0.0
|
threads
|
int
|
Number of CPU threads for parallel processing. Defaults to system CPU count. |
cpu_count()
|
verbose
|
int
|
Verbosity level for process logging, where higher values increase output detail. |
0
|
save_alignment
|
bool
|
If True, saves alignment results to a compressed CSV file. |
False
|
filename
|
str
|
Name for the output file if |
None
|
**kwargs
|
Additional keyword arguments for compatibility. |
{}
|
Returns:
Type | Description |
---|---|
Union[pd.DataFrame, np.ndarray]
|
DataFrame with columns |
Raises:
Type | Description |
---|---|
ImportError
|
If Foldseek is not installed or accessible in the system PATH. |
ValueError
|
If |
sequence_similarity_mmseqs(df_query, df_target=None, field_name='sequence', prefilter=True, denominator='shortest', threads=cpu_count(), is_nucleotide=False, threshold=0.0, verbose=0, save_alignment=False, filename=None)
¶
Calculate pairwise sequence similarity between query and target sequences using MMSeqs2, with optional prefiltering for efficiency. Designed for parallel execution and customizable alignment parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_query
|
DataFrame
|
DataFrame containing the query sequences. Each row should have a column specified by |
required |
df_target
|
DataFrame
|
DataFrame with target sequences, where each row has a |
None
|
field_name
|
str
|
Column name in |
'sequence'
|
prefilter
|
bool
|
If True, performs an initial filtering step to reduce the number of comparisons. |
True
|
denominator
|
str
|
Determines how similarity is calculated, using either "shortest" (default), "longest", or the number of aligned residues ( |
'shortest'
|
threads
|
int
|
Number of threads for parallel processing. Defaults to system CPU count. |
cpu_count()
|
is_nucleotide
|
bool
|
Set to True if sequences are nucleotide-based. Defaults to False (for protein sequences). |
False
|
threshold
|
float
|
Minimum similarity metric for alignment entries to be included in the output. Defaults to 0.0. |
0.0
|
verbose
|
int
|
Verbosity level, where 0 is silent and higher levels increase detail in logging. |
0
|
save_alignment
|
bool
|
If True, saves the resulting DataFrame to a compressed CSV file. |
False
|
filename
|
str
|
Filename for saving the alignment results if |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
DataFrame with columns |
Raises:
Type | Description |
---|---|
RuntimeError
|
If MMSeqs2 is not installed or is unavailable in the system PATH. |
ValueError
|
If |
sequence_similarity_needle(df_query, df_target=None, field_name='sequence', denominator='shortest', is_nucleotide=False, config=None, threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None)
¶
Calculate pairwise sequence similarity between query and target sequences using the
EMBOSS needleall
tool. This function is designed for efficient parallel processing
and supports custom alignment parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_query
|
DataFrame
|
DataFrame containing the query sequences. Each row should have a column specified by |
required |
df_target
|
DataFrame
|
DataFrame containing the target sequences, with a column specified by |
None
|
field_name
|
str
|
Name of the column in |
'sequence'
|
denominator
|
str
|
Determines how similarity is calculated; options are "shortest" (default), "longest", or "average" sequence length between pairs. |
'shortest'
|
is_nucleotide
|
bool
|
Indicates if the sequences are nucleotide sequences. If False, assumes sequences are protein-based. |
False
|
config
|
dict
|
Dictionary of EMBOSS |
None
|
threshold
|
float
|
Minimum similarity metric required for alignment entries to be included in the output. Defaults to 0.0. |
0.0
|
threads
|
int
|
Number of threads to use for parallel processing. Defaults to system CPU count. |
cpu_count()
|
verbose
|
int
|
Verbosity level of function output; 0 is silent, higher numbers increase output detail. |
0
|
save_alignment
|
bool
|
If True, saves the resulting DataFrame to a compressed CSV file. |
False
|
filename
|
str
|
Filename for saving the alignment results if |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
DataFrame with columns |
Raises:
Type | Description |
---|---|
ImportError
|
Raised if the |
ValueError
|
Raised if |
RuntimeError
|
Raised if any alignment job encounters an exception during processing. |
sequence_similarity_peptides(df_query, df_target=None, field_name='sequence', denominator='shortest', threads=cpu_count(), threshold=0.0, verbose=0, save_alignment=False, filename=None)
¶
Calculates pairwise sequence similarity between query and target peptide sequences using MMSeqs2. Sequences are divided into "small," "medium," and "normal" categories based on length, and each category is aligned with a specific method for optimal recall.
- _small_alignment: For sequences with 8 or fewer residues, checks if one sequence is a subsequence of the other.
- _medium_alignment: For sequences between 9 and 20 residues, uses a lower threshold with MMSeqs2 to filter alignments.
- _normal_alignment: For sequences longer than 20 residues, performs full alignments with MMSeqs2.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_query
|
DataFrame
|
DataFrame containing the query peptide sequences. Each row should have a column specified by |
required |
df_target
|
Optional[DataFrame]
|
DataFrame with target peptide sequences, where each row has a |
None
|
field_name
|
Optional[str]
|
Column name in |
'sequence'
|
denominator
|
Optional[str]
|
Determines how similarity is calculated, using either "shortest" (default), "longest", or the number of aligned residues ( |
'shortest'
|
threads
|
Optional[int]
|
Number of threads for parallel processing. Defaults to system CPU count. |
cpu_count()
|
threshold
|
Optional[float]
|
Minimum similarity metric for alignment entries to be included in the output. Defaults to 0.0. |
0.0
|
verbose
|
Optional[int]
|
Verbosity level, where 0 is silent and higher levels increase detail in logging. |
0
|
save_alignment
|
Optional[bool]
|
If True, saves the resulting DataFrame to a compressed CSV file. |
False
|
filename
|
Optional[int]
|
Filename for saving the alignment results if |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
DataFrame with columns |
Raises:
Type | Description |
---|---|
RuntimeError
|
If MMSeqs2 is not installed or is unavailable in the system PATH. |
ValueError
|
If |
sim_df2mtx(sim_df, size_query=None, size_target=None, threshold=0.0, filter_smaller=True, boolean_out=True)
¶
Converts a DataFrame of similarity scores into a sparse matrix representation, optionally filtering based on a similarity threshold and producing a boolean or numerical output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sim_df
|
Union[DataFrame, DataFrame]
|
DataFrame containing similarity data with |
required |
size_query
|
Optional[int]
|
Total number of unique query indices, defining the first dimension of the matrix. Defaults to the number of unique queries in |
None
|
size_target
|
Optional[int]
|
Total number of unique target indices, defining the second dimension of the matrix. Defaults to |
None
|
threshold
|
Optional[float]
|
Similarity score threshold for filtering. Defaults to 0.0. |
0.0
|
filter_smaller
|
Optional[bool]
|
If True, retains values above the threshold. If False, retains values below it. |
True
|
boolean_out
|
Optional[bool]
|
If True, converts output to boolean values, representing presence/absence of similarity. If False, retains original similarity values. |
True
|
Returns:
Type | Description |
---|---|
spr.csr_matrix
|
Symmetric sparse matrix of filtered similarity scores, either in boolean or numerical format. |