Skip to content

Similarity calculation

embedding_similarity(query_embds, target_embds=None, sim_function='cosine', threads=cpu_count(), threshold=0.0, verbose=3, save_alignment=False, filename=None, **kwargs)

Calculates pairwise similarity between embeddings in query_embds and target_embds using specified similarity functions. Supports parallel processing to handle large datasets efficiently.

Parameters:

Name Type Description Default
query_embds ndarray

Array of embeddings for the query set. Each row should represent a single embedding.

required
target_embds Optional[ndarray]

Array of embeddings for the target set. If None, self-comparison of query_embds is performed.

None
sim_function Union[str, Callable]

Similarity function to use for pairwise comparison. Default is 'cosine'. Can be either a string specifying a built-in function or a custom callable.

'cosine'
threads int

Number of CPU threads for parallel processing. Defaults to the system CPU count.

cpu_count()
threshold float

Minimum similarity score required to include a pair in the results. Defaults to 0.0.

0.0
save_alignment bool

If True, saves the alignment results to a compressed CSV file.

False
filename str

Name for the output file if save_alignment is True. Defaults to a timestamp if None.

None
**kwargs

Additional keyword arguments for compatibility.

{}

Returns:

Type Description
pl.DataFrame

DataFrame with columns query, target, and metric, where each row represents a pairwise similarity score above the specified threshold.

Raises:

Type Description
RuntimeError

If any exception occurs in a thread during similarity calculation.

KeyError

If the specified sim_function is not supported.

molecular_similarity(df_query, df_target=None, field_name='smiles', sim_function='jaccard', fingerprint='mapc', bits=1024, radius=2, threshold=0.0, threads=cpu_count(), verbose=3, save_alignment=False, filename=None, **kwargs)

Calculates pairwise molecular similarity between query and target molecules using specified fingerprint and similarity functions. Uses RDKit for molecular fingerprinting and similarity calculations.

Parameters:

Name Type Description Default
df_query DataFrame

DataFrame containing SMILES strings of query molecules. Each row should have a column specified by field_name with SMILES strings.

required
df_target DataFrame

DataFrame containing SMILES strings of target molecules. If None, self-comparison of df_query is performed.

None
field_name str

Column name in df_query and df_target that contains SMILES strings. Defaults to 'smiles'.

'smiles'
sim_function str

Similarity function to use for pairwise comparison. Options include 'tanimoto', 'dice', 'sokal', 'rogot-goldberg', 'jaccard', 'canberra', and 'cosine'. Defaults to 'jaccard'.

'jaccard'
fingerprint str

Type of fingerprint to use, options are 'ecfp' (Extended-Connectivity Fingerprint), 'maccs' (MACCS keys), 'mapc' (requires the mapchiral package), or `lipinski. Defaults to 'mapc'.

'mapc'
bits int

Size of the fingerprint bit vector, applicable to ecfp and mapc. Defaults to 1024.

1024
radius int

Radius for the ECFP fingerprint, applicable to ecfp and mapc. Defaults to 2.

2
threshold float

Minimum similarity score required to include a pair in the results. Defaults to 0.0.

0.0
threads int

Number of CPU threads for parallel processing. Defaults to the system CPU count.

cpu_count()
verbose int

Verbosity level, where higher values increase output detail. Defaults to 0.

3
save_alignment bool

If True, saves the alignment results to a compressed CSV file.

False
filename str

Name for the output file if save_alignment is True. Defaults to a timestamp if None.

None
**kwargs

Additional keyword arguments for compatibility.

{}

Returns:

Type Description
pl.DataFrame

DataFrame with columns query, target, and metric, where each row represents a pairwise similarity score above the specified threshold.

Raises:

Type Description
ImportError

If RDKit (or mapchiral, if used with 'mapc') is not installed.

ValueError

If field_name is missing from df_query or df_target.

NotImplementedError

If sim_function is not supported by the function.

protein_structure_similarity(df_query, df_target=None, field_name='structure', prefilter=True, denominator='shortest', representation='3di+aa', threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, **kwargs)

Calculates pairwise structural similarity between query and target protein structures using Foldseek. Supports alignment based on various representations, including 3D alignment, TM alignment, and combined 3D and amino acid alignments.

Parameters:

Name Type Description Default
df_query DataFrame

DataFrame containing query protein structures. Each row should have a column specified by field_name with paths to PDB files.

required
df_target DataFrame

DataFrame containing target protein structures, with each row holding paths to PDB files in field_name. If None, self-comparison of df_query is performed.

None
field_name str

Column name in df_query and df_target with paths to PDB structure files. Defaults to 'structure'.

'structure'
prefilter bool

Enables prefiltering to reduce computation. Defaults to True.

True
denominator str

Determines similarity normalization, using "shortest" (default), "longest", or the number of aligned residues (n_aligned).

'shortest'
representation str

Alignment representation mode, with options '3di', 'TM', or '3di+aa'. Defaults to '3di+aa'.

'3di+aa'
threshold float

Minimum similarity metric required to include an alignment in the results. Defaults to 0.0.

0.0
threads int

Number of CPU threads for parallel processing. Defaults to system CPU count.

cpu_count()
verbose int

Verbosity level for process logging, where higher values increase output detail.

0
save_alignment bool

If True, saves alignment results to a compressed CSV file.

False
filename str

Name for the output file if save_alignment is True. Defaults to a timestamp if None.

None
**kwargs

Additional keyword arguments for compatibility.

{}

Returns:

Type Description
Union[pd.DataFrame, np.ndarray]

DataFrame with columns query, target, and metric, where each row represents an alignment with a similarity metric above threshold. Returns the metric value determined by representation.

Raises:

Type Description
ImportError

If Foldseek is not installed or accessible in the system PATH.

ValueError

If field_name is missing from df_query or df_target.

sequence_similarity_mmseqs(df_query, df_target=None, field_name='sequence', prefilter=True, denominator='shortest', threads=cpu_count(), is_nucleotide=False, threshold=0.0, verbose=0, save_alignment=False, filename=None)

Calculate pairwise sequence similarity between query and target sequences using MMSeqs2, with optional prefiltering for efficiency. Designed for parallel execution and customizable alignment parameters.

Parameters:

Name Type Description Default
df_query DataFrame

DataFrame containing the query sequences. Each row should have a column specified by field_name with sequence strings.

required
df_target DataFrame

DataFrame with target sequences, where each row has a field_name column containing sequence strings. If None, df_query will be used for self-comparisons.

None
field_name str

Column name in df_query and df_target holding the sequence data to be aligned. Defaults to 'sequence'.

'sequence'
prefilter bool

If True, performs an initial filtering step to reduce the number of comparisons.

True
denominator str

Determines how similarity is calculated, using either "shortest" (default), "longest", or the number of aligned residues (n_aligned).

'shortest'
threads int

Number of threads for parallel processing. Defaults to system CPU count.

cpu_count()
is_nucleotide bool

Set to True if sequences are nucleotide-based. Defaults to False (for protein sequences).

False
threshold float

Minimum similarity metric for alignment entries to be included in the output. Defaults to 0.0.

0.0
verbose int

Verbosity level, where 0 is silent and higher levels increase detail in logging.

0
save_alignment bool

If True, saves the resulting DataFrame to a compressed CSV file.

False
filename str

Filename for saving the alignment results if save_alignment is True. If None, a timestamp is used as the filename.

None

Returns:

Type Description
pd.DataFrame

DataFrame with columns query, target, and metric, where each row represents an alignment result with similarity metric above the threshold.

Raises:

Type Description
RuntimeError

If MMSeqs2 is not installed or is unavailable in the system PATH.

ValueError

If field_name is not found in df_query or df_target.

sequence_similarity_needle(df_query, df_target=None, field_name='sequence', denominator='shortest', is_nucleotide=False, config=None, threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename=None)

Calculate pairwise sequence similarity between query and target sequences using the EMBOSS needleall tool. This function is designed for efficient parallel processing and supports custom alignment parameters.

Parameters:

Name Type Description Default
df_query DataFrame

DataFrame containing the query sequences. Each row should have a column specified by field_name that contains sequence strings.

required
df_target DataFrame

DataFrame containing the target sequences, with a column specified by field_name containing sequence strings. If None, df_query will be used as the target DataFrame, performing self-comparisons.

None
field_name str

Name of the column in df_query and df_target containing the sequence data to be compared. Defaults to 'sequence'.

'sequence'
denominator str

Determines how similarity is calculated; options are "shortest" (default), "longest", or "average" sequence length between pairs.

'shortest'
is_nucleotide bool

Indicates if the sequences are nucleotide sequences. If False, assumes sequences are protein-based.

False
config dict

Dictionary of EMBOSS needleall alignment parameters, such as gapopen, gapextend, etc. If None, default configuration is used.

None
threshold float

Minimum similarity metric required for alignment entries to be included in the output. Defaults to 0.0.

0.0
threads int

Number of threads to use for parallel processing. Defaults to system CPU count.

cpu_count()
verbose int

Verbosity level of function output; 0 is silent, higher numbers increase output detail.

0
save_alignment bool

If True, saves the resulting DataFrame to a compressed CSV file.

False
filename str

Filename for saving the alignment results if save_alignment is True. If None, a timestamp will be used as the filename.

None

Returns:

Type Description
pd.DataFrame

DataFrame with columns query, target, and metric, where each row represents a sequence alignment result, filtered by the specified threshold.

Raises:

Type Description
ImportError

Raised if the needleall tool from EMBOSS is not installed.

ValueError

Raised if field_name is missing from df_query or df_target.

RuntimeError

Raised if any alignment job encounters an exception during processing.

sequence_similarity_peptides(df_query, df_target=None, field_name='sequence', denominator='shortest', threads=cpu_count(), threshold=0.0, verbose=0, save_alignment=False, filename=None)

Calculates pairwise sequence similarity between query and target peptide sequences using MMSeqs2. Sequences are divided into "small," "medium," and "normal" categories based on length, and each category is aligned with a specific method for optimal recall.

  • _small_alignment: For sequences with 8 or fewer residues, checks if one sequence is a subsequence of the other.
  • _medium_alignment: For sequences between 9 and 20 residues, uses a lower threshold with MMSeqs2 to filter alignments.
  • _normal_alignment: For sequences longer than 20 residues, performs full alignments with MMSeqs2.

Parameters:

Name Type Description Default
df_query DataFrame

DataFrame containing the query peptide sequences. Each row should have a column specified by field_name with peptide sequence strings.

required
df_target Optional[DataFrame]

DataFrame with target peptide sequences, where each row has a field_name column containing sequence strings. If None, df_query will be used for self-comparisons.

None
field_name Optional[str]

Column name in df_query and df_target holding the sequence data to be aligned. Defaults to 'sequence'.

'sequence'
denominator Optional[str]

Determines how similarity is calculated, using either "shortest" (default), "longest", or the number of aligned residues (n_aligned).

'shortest'
threads Optional[int]

Number of threads for parallel processing. Defaults to system CPU count.

cpu_count()
threshold Optional[float]

Minimum similarity metric for alignment entries to be included in the output. Defaults to 0.0.

0.0
verbose Optional[int]

Verbosity level, where 0 is silent and higher levels increase detail in logging.

0
save_alignment Optional[bool]

If True, saves the resulting DataFrame to a compressed CSV file.

False
filename Optional[int]

Filename for saving the alignment results if save_alignment is True. If None, a timestamp is used as the filename.

None

Returns:

Type Description
pd.DataFrame

DataFrame with columns query, target, and metric, where each row represents an alignment result with similarity metric above the threshold.

Raises:

Type Description
RuntimeError

If MMSeqs2 is not installed or is unavailable in the system PATH.

ValueError

If field_name is not found in df_query or df_target.

sim_df2mtx(sim_df, size_query=None, size_target=None, threshold=0.0, filter_smaller=True, boolean_out=True)

Converts a DataFrame of similarity scores into a sparse matrix representation, optionally filtering based on a similarity threshold and producing a boolean or numerical output.

Parameters:

Name Type Description Default
sim_df Union[DataFrame, DataFrame]

DataFrame containing similarity data with query, target, and metric columns.

required
size_query Optional[int]

Total number of unique query indices, defining the first dimension of the matrix. Defaults to the number of unique queries in sim_df.

None
size_target Optional[int]

Total number of unique target indices, defining the second dimension of the matrix. Defaults to size_query, assuming a square matrix.

None
threshold Optional[float]

Similarity score threshold for filtering. Defaults to 0.0.

0.0
filter_smaller Optional[bool]

If True, retains values above the threshold. If False, retains values below it.

True
boolean_out Optional[bool]

If True, converts output to boolean values, representing presence/absence of similarity. If False, retains original similarity values.

True

Returns:

Type Description
spr.csr_matrix

Symmetric sparse matrix of filtered similarity scores, either in boolean or numerical format.