AutoPeptideML¶

Main class for handling the automatic development of bioactive peptide ML predictors.

Initialize instance of the AutoPeptideML class

Parameters:

Name	Type	Description	Default
`verbose`	`bool`	Whether to output information, defaults to True	`True`
`threads`	`int`	Number of threads to compute parallelise processes, defaults to cpu_count()	`cpu_count()`
`seed`	`int`	Pseudo-random number generator seed. Important for reproducibility, defaults to 42	`42`

`autosearch_negatives(df_pos, positive_tags, proportion=1.0, target_db=None)` ¶

Method for searching bioactive databases for peptides

Parameters:

Name	Type	Description	Default
`df_pos`	`DataFrame`	DataFrame with positive peptides.	required
`positive_tags`	`List[str]`	List of names of bioactivities that may overlap with the target bioactivities.	required
`proportion`	`float`	Negative:Positive ration in the new dataset. Defaults to 1.0., defaults to 1.0.	`1.0`
`target_db`	`Optional[str]`	Path to CSV containing a database with columns `sequence` and bioactivities.	`None`

Returns:

Type	Description
`pd.DataFrame`	New dataset with both positive and negative peptides.

`balance_samples(df)` ¶

Oversample the underrepresented class in the DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with positive and negative peptides to be balanced.	required

Returns:

Type	Description
`pd.DataFrame`	DataFrame with balanced number of positive and negative peptides.

`compute_representations(datasets, re)` ¶

Use a Protein Representation Model, loaded with the RepresentationEngine class to compute representations for the peptides in the dataasets.

Parameters:

Name	Type	Description	Default
`datasets`	`Dict[str, DataFrame]`	dictionary with the dataset partitions as DataFrames. Output from the method `train_test_partition`.	required
`re`	`RepresentationEngine`	class with a Protein Representation Model.	required

Returns:

Type	Description
`dict`	Dictionary with pd.DataFrame `id` column as keys and the representation of the `sequence` column as values.

`curate_dataset(dataset, outputdir=None)` ¶

Load a DataFrame or use one already loaded and then remove all entries with non-canonical residues or repeated sequences.

Parameters:

Name	Type	Description	Default
`dataset`	`Union[str, DataFrame]`	Dataset or path to dataset.	required
`outputdir`	`str`	Path were to save the curated dataset, defaults to None	`None`

Returns:

Type	Description
`pd.DataFrame`	Curated dataset.

`evaluate_model(best_model, test_df, id2rep, outputdir)` ¶

Evaluate an ensemble model.

Parameters:

Name	Type	Description	Default
`best_model`	`list`	List of models with a `predict_proba` method.	required
`test_df`	`DataFrame`	Evaluation dataset with `id`, `sequence` and `Y` columns.	required
`id2rep`	`dict`	Dictionary with keys being the `id` and the values the peptide representations.	required
`outputdir`	`str`	Path were to save the evaluation data.	required

Returns:

Type	Description
`pd.DataFrame`	Dataset with the evaluation metrics.

`hpo_train(config, train_df, id2rep, folds, outputdir, n_jobs=1)` ¶

Hyperparameter Optimisation and training.

Parameters:

Name	Type	Description	Default
`config`	`dict`	dictionary with hyperparameter search space.	required
`train_df`	`DataFrame`	Training dataset with `id` column and `Y` column with the bioactivity target.	required
`id2rep`	`dict`	Dictionary with pd.DataFrame `id` column as keys and the representation of the `sequence` column as values.	required
`folds`	`list`	List with the training/validation folds	required
`outputdir`	`str`	Path to the directory where information should be saved.	required
`n_jobs`	`int`	Number of threads to parallelise the training, defaults to 1.	`1`

Returns:

Type	Description
`list`	List with the models that comprise the final ensemble.

`predict(df, re, ensemble_path, outputdir, df_repr=None, backend='onnx')` ¶

Predicts scores and uncertainties for input sequences using a pre-trained ensemble of models.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing sequences to be predicted. Must include a 'sequence' column.	required
`re`	`RepresentationEngine`	Representation engine for computing feature representations of sequences.	required
`ensemble_path`	`str`	Path to the directory containing the ensemble of trained models.	required
`outputdir`	`str`	Directory where prediction results will be saved. Created if it does not exist.	required
`df_repr`	`list`	Precomputed representations of the input sequences. If None, representations are computed.	`None`
`backend`	`str`	Backend used for prediction. Supported values are 'onnx' (default) and 'joblib'.	`'onnx'`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with predictions, including 'score' (average prediction) and 'score_uncertainty' (standard deviation).

Raises:

Type Description

ImportError

If required libraries for the selected backend are not installed.

NotImplementedError

If an unsupported backend is specified. Notes: - Converts joblib models to ONNX format if backend='onnx' and joblib models are provided. - Saves predictions to 'predictions.csv' in the specified outputdir. Example: >>> from mymodule import RepresentationEngine, Predictor >>> predictor = Predictor(verbose=True) >>> df = pd.DataFrame({'sequence': ['ATCG', 'GCTA']}) >>> re = RepresentationEngine() >>> predictions = predictor.predict( ... df, re, ensemble_path='./ensemble', outputdir='./output' ... ) >>> print(predictions)

`train_test_partition(df, threshold=0.3, test_size=0.2, denominator='n_aligned', alignment=None, outputdir='./splits')` ¶

Novel homology partitioning algorithm for generating independent hold-out evaluation sets.

This method partitions the provided dataset into training and testing sets based on sequence similarity. It ensures that sequences in the training and testing sets do not exceed a specified sequence identity threshold, resulting in distinct datasets for evaluation.

:example: data = pd.DataFrame({'id': [...], 'sequence': [...], 'Y': [...]}) partitioned_data = train_test_partition( df=data, threshold=0.4, test_size=0.25, denominator='shortest', alignment='needle', outputdir='./data_splits' )

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Dataset to partition with the following columns: `id`, `sequence`, and `Y`.	required
`threshold`	`float`	Maximum sequence identity allowed between sequences in training and evaluation sets. Sequences exceeding this threshold in similarity will not appear in both sets. Defaults to 0.3.	`0.3`
`test_size`	`float`	Proportion of samples in evaluation (test) set. A float between 0 and 1, where 0.2 means 20% of the dataset will be allocated to the test set. Defaults to 0.2.	`0.2`
`denominator`	`str`	Denominator used to calculate sequence identity between pairs of sequences. Options include: - `'shortest'`: The shortest sequence length. - `'longest'`: The longest sequence length. - `'n_aligned'`: The length of the aligned region between sequences.	`'n_aligned'`
`alignment`	`str`	Sequence alignment method to compute similarity. Options include: - `'peptides'`: Peptide sequence alignment. - `'mmseqs'`: Local Smith-Waterman alignment. - `'mmseqs+prefilter'`: Fast alignment using Smith-Waterman with k-mer prefiltering. - `'needle'`: Global Needleman-Wunsch alignment.	`None`
`outputdir`	`str`	Directory where the resulting train and test CSV files will be saved. Defaults to `'./splits'`.	`'./splits'`

Returns:

Type	Description
`Dict[str, pd.DataFrame]`	A dictionary containing the training and testing DataFrames: - `'train'`: The DataFrame for the training set. - `'test'`: The DataFrame for the testing set.

Raises:

Type	Description
`FileNotFoundError`	If the output directory cannot be created or accessed.
`ValueError`	If an unsupported alignment method is specified.

`train_val_partition(df, method='random', threshold=0.5, alignment='peptides', denominator='n_aligned', n_folds=10, outputdir='./folds')` ¶

Method for generating n training/validation folds for cross-validation.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Training dataset with `id`, `sequence`, and `Y` columns.	required
`method`	`str`	Method for generating the folds. Options available: `random` through `sklearn.model_selection.StratifiedKFold` or `graph-part` through `graphpart.stratified_k_fold`, defaults to `random`.	`'random'`
`threshold`	`float`	If mode is `graph-part`, maximum sequence identity allowed between sequences in training and evaluation sets, defaults to 0.5	`0.5`
`denominator`	`str`	Denominator to calculate sequence identity. Options; - `shortest`: Shortest sequence length - `longest`: Longest sequence length - `n_aligned`: Length of the alignment	`'n_aligned'`
`alignment`	`str`	If mode is `graph-part`, alignment algorithm to use. Options available: `mmseqs` (local Smith-Waterman alignment), `mmseqs+prefilter` (local fast alignment Smith-Waterman + k-mer prefiltering), and `needle` (global Needleman-Wunch alignment), defaults to 'mmseqs+prefiler', defaults to 'mmseqs+prefilter'	`'peptides'`
`n_folds`	`int`	Number of training/validation folds to generate, defaults to 10	`10`
`outputdir`	`str`	Path where data should be saved, defaults to './folds'	`'./folds'`

Returns:

Type	Description
`list`	List of training/validation folds

AutoPeptideML¶

autosearch_negatives(df_pos, positive_tags, proportion=1.0, target_db=None) ¶

balance_samples(df) ¶

compute_representations(datasets, re) ¶

curate_dataset(dataset, outputdir=None) ¶

evaluate_model(best_model, test_df, id2rep, outputdir) ¶

hpo_train(config, train_df, id2rep, folds, outputdir, n_jobs=1) ¶

predict(df, re, ensemble_path, outputdir, df_repr=None, backend='onnx') ¶

train_test_partition(df, threshold=0.3, test_size=0.2, denominator='n_aligned', alignment=None, outputdir='./splits') ¶

train_val_partition(df, method='random', threshold=0.5, alignment='peptides', denominator='n_aligned', n_folds=10, outputdir='./folds') ¶

`autosearch_negatives(df_pos, positive_tags, proportion=1.0, target_db=None)` ¶

`balance_samples(df)` ¶

`compute_representations(datasets, re)` ¶

`curate_dataset(dataset, outputdir=None)` ¶

`evaluate_model(best_model, test_df, id2rep, outputdir)` ¶

`hpo_train(config, train_df, id2rep, folds, outputdir, n_jobs=1)` ¶

`predict(df, re, ensemble_path, outputdir, df_repr=None, backend='onnx')` ¶

`train_test_partition(df, threshold=0.3, test_size=0.2, denominator='n_aligned', alignment=None, outputdir='./splits')` ¶

`train_val_partition(df, method='random', threshold=0.5, alignment='peptides', denominator='n_aligned', n_folds=10, outputdir='./folds')` ¶