AutoPeptideML¶
Main class for handling the automatic development of bioactive peptide ML predictors.
Initialize instance of the AutoPeptideML class
Parameters:
Name | Type | Description | Default |
---|---|---|---|
verbose
|
bool
|
Whether to output information, defaults to True |
True
|
threads
|
int
|
Number of threads to compute parallelise processes, defaults to cpu_count() |
cpu_count()
|
seed
|
int
|
Pseudo-random number generator seed. Important for reproducibility, defaults to 42 |
42
|
autosearch_negatives(df_pos, positive_tags, proportion=1.0, target_db=None)
¶
Method for searching bioactive databases for peptides
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_pos
|
DataFrame
|
DataFrame with positive peptides. |
required |
positive_tags
|
List[str]
|
List of names of bioactivities that may overlap with the target bioactivities. |
required |
proportion
|
float
|
Negative:Positive ration in the new dataset. Defaults to 1.0., defaults to 1.0. |
1.0
|
target_db
|
Optional[str]
|
Path to CSV containing a database with columns |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
New dataset with both positive and negative peptides. |
balance_samples(df)
¶
Oversample the underrepresented class in the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
DataFrame with positive and negative peptides to be balanced. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
DataFrame with balanced number of positive and negative peptides. |
compute_representations(datasets, re)
¶
Use a Protein Representation Model, loaded with the
RepresentationEngine class to compute representations
for the peptides in the dataasets
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets
|
Dict[str, DataFrame]
|
dictionary with the dataset partitions as DataFrames. Output from the method |
required |
re
|
RepresentationEngine
|
class with a Protein Representation Model. |
required |
Returns:
Type | Description |
---|---|
dict
|
Dictionary with pd.DataFrame |
curate_dataset(dataset, outputdir=None)
¶
Load a DataFrame or use one already loaded and then remove all entries with non-canonical residues or repeated sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Union[str, DataFrame]
|
Dataset or path to dataset. |
required |
outputdir
|
str
|
Path were to save the curated dataset, defaults to None |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
Curated dataset. |
evaluate_model(best_model, test_df, id2rep, outputdir)
¶
Evaluate an ensemble model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
best_model
|
list
|
List of models with a |
required |
test_df
|
DataFrame
|
Evaluation dataset with |
required |
id2rep
|
dict
|
Dictionary with keys being the |
required |
outputdir
|
str
|
Path were to save the evaluation data. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
Dataset with the evaluation metrics. |
hpo_train(config, train_df, id2rep, folds, outputdir, n_jobs=1)
¶
Hyperparameter Optimisation and training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
dict
|
dictionary with hyperparameter search space. |
required |
train_df
|
DataFrame
|
Training dataset with |
required |
id2rep
|
dict
|
Dictionary with pd.DataFrame |
required |
folds
|
list
|
List with the training/validation folds |
required |
outputdir
|
str
|
Path to the directory where information should be saved. |
required |
n_jobs
|
int
|
Number of threads to parallelise the training, defaults to 1. |
1
|
Returns:
Type | Description |
---|---|
list
|
List with the models that comprise the final ensemble. |
predict(df, re, ensemble_path, outputdir, df_repr=None, backend='onnx')
¶
Predicts scores and uncertainties for input sequences using a pre-trained ensemble of models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
DataFrame containing sequences to be predicted. Must include a 'sequence' column. |
required |
re
|
RepresentationEngine
|
Representation engine for computing feature representations of sequences. |
required |
ensemble_path
|
str
|
Path to the directory containing the ensemble of trained models. |
required |
outputdir
|
str
|
Directory where prediction results will be saved. Created if it does not exist. |
required |
df_repr
|
list
|
Precomputed representations of the input sequences. If None, representations are computed. |
None
|
backend
|
str
|
Backend used for prediction. Supported values are 'onnx' (default) and 'joblib'. |
'onnx'
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
DataFrame with predictions, including 'score' (average prediction) and 'score_uncertainty' (standard deviation). |
Raises:
Type | Description |
---|---|
ImportError
|
If required libraries for the selected backend are not installed. |
NotImplementedError
|
If an unsupported backend is specified. Notes: - Converts joblib models to ONNX format if |
train_test_partition(df, threshold=0.3, test_size=0.2, denominator='n_aligned', alignment=None, outputdir='./splits')
¶
Novel homology partitioning algorithm for generating independent hold-out evaluation sets.
This method partitions the provided dataset into training and testing sets based on sequence similarity. It ensures that sequences in the training and testing sets do not exceed a specified sequence identity threshold, resulting in distinct datasets for evaluation.
:example: data = pd.DataFrame({'id': [...], 'sequence': [...], 'Y': [...]}) partitioned_data = train_test_partition( df=data, threshold=0.4, test_size=0.25, denominator='shortest', alignment='needle', outputdir='./data_splits' )
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Dataset to partition with the following columns: |
required |
threshold
|
float
|
Maximum sequence identity allowed between sequences in training and evaluation sets. Sequences exceeding this threshold in similarity will not appear in both sets. Defaults to 0.3. |
0.3
|
test_size
|
float
|
Proportion of samples in evaluation (test) set. A float between 0 and 1, where 0.2 means 20% of the dataset will be allocated to the test set. Defaults to 0.2. |
0.2
|
denominator
|
str
|
Denominator used to calculate sequence identity between pairs of sequences. Options include: - |
'n_aligned'
|
alignment
|
str
|
Sequence alignment method to compute similarity. Options include: - |
None
|
outputdir
|
str
|
Directory where the resulting train and test CSV files will be saved. Defaults to |
'./splits'
|
Returns:
Type | Description |
---|---|
Dict[str, pd.DataFrame]
|
A dictionary containing the training and testing DataFrames: - |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the output directory cannot be created or accessed. |
ValueError
|
If an unsupported alignment method is specified. |
train_val_partition(df, method='random', threshold=0.5, alignment='peptides', denominator='n_aligned', n_folds=10, outputdir='./folds')
¶
Method for generating n
training/validation folds for
cross-validation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Training dataset with |
required |
method
|
str
|
Method for generating the folds. Options available: |
'random'
|
threshold
|
float
|
If mode is |
0.5
|
denominator
|
str
|
Denominator to calculate sequence identity. Options; - |
'n_aligned'
|
alignment
|
str
|
If mode is |
'peptides'
|
n_folds
|
int
|
Number of training/validation folds to generate, defaults to 10 |
10
|
outputdir
|
str
|
Path where data should be saved, defaults to './folds' |
'./folds'
|
Returns:
Type | Description |
---|---|
list
|
List of training/validation folds |