qbiocode.learning.compute_dt module#

Summary#

Functions:

compute_dt

This function generates a model using a Decision Tree (dt) Classifier method as implemented in scikit-learn (https://scikit-learn.org/1.5/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

compute_dt_opt

This function also generates a model using a Decision Tree (dt) Classifier method as implemented in scikit-learn (https://scikit-learn.org/1.5/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

Reference#

compute_dt(X_train, X_test, y_train, y_test, args, verbose=False, model='Decision Tree', data_key='', criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0, monotonic_cst=None)[source]#

This function generates a model using a Decision Tree (dt) Classifier method as implemented in scikit-learn (https://scikit-learn.org/1.5/modules/generated/sklearn.tree.DecisionTreeClassifier.html). It takes in parameter arguments specified in the config.yaml file, but will use the default parameters specified above if none are passed. The model is trained on the training dataset and validated on the test dataset. The model is trained on the training dataset and validated on the test dataset. The function returns the evaluation of the model on the test dataset, including accuracy, AUC, F1 score, and the time taken to train and validate the model. This function is designed to be used in a supervised learning context, where the goal is to classify data points.

Parameters:
  • X_train (array-like) – Training data features.

  • X_test (array-like) – Test data features.

  • y_train (array-like) – Training data labels.

  • y_test (array-like) – Test data labels.

  • args (dict) – Additional arguments, typically from config.yaml.

  • verbose (bool) – If True, prints additional information during execution.

  • model (str) – Name of the model being used, default is ‘Decision Tree’.

  • data_key (str) – Key for the dataset, if applicable.

  • criterion (str) – The function to measure the quality of a split. Default is ‘gini’.

  • splitter (str) – The strategy used to choose the split at each node. Default is ‘best’.

  • max_depth (int or None) – The maximum depth of the tree. Default is None.

  • min_samples_split (int) – The minimum number of samples required to split an internal node. Default is 2.

  • min_samples_leaf (int) – The minimum number of samples required to be at a leaf node. Default is 1.

  • min_weight_fraction_leaf (float) – The minimum weighted fraction of the sum total of weights required to be at a leaf node. Default is 0.0.

  • max_features (int, float, str or None) – The number of features to consider when looking for the best split. Default is None.

  • random_state (int or None) – Controls the randomness of the estimator. Default is None.

  • max_leaf_nodes (int or None) – Grow a tree with max_leaf_nodes in best-first fashion. Default is None.

  • min_impurity_decrease (float) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value. Default is 0.0.

  • class_weight (dict or 'balanced' or None) – Weights associated with classes in the form {class_label: weight}. Default is None.

  • ccp_alpha (float) – Complexity parameter used for Minimal Cost-Complexity Pruning. Default is 0.0.

  • monotonic_cst – Monotonic constraints for tree nodes, if applicable. Default is None.

Returns:

A dictionary containing the evaluation metrics, model parameters, and time taken for training and validation.

Return type:

modeleval (dict)

compute_dt_opt(X_train, X_test, y_train, y_test, args, verbose=False, model='Decision Tree', cv=5, criterion=[], max_depth=[], min_samples_split=[], min_samples_leaf=[], max_features=[])[source]#

This function also generates a model using a Decision Tree (dt) Classifier method as implemented in scikit-learn (https://scikit-learn.org/1.5/modules/generated/sklearn.tree.DecisionTreeClassifier.html). The difference here is that this function runs a grid search. The range of the grid search for each parameter is specified in the config.yaml file. The combination of parameters that led to the best performance is saved and returned as best_params, which can then be used on similar datasets, without having to run the grid search. The model is trained on the training dataset and validated on the test dataset. The model is trained on the training dataset and validated on the test dataset. The function returns the evaluation of the model on the test dataset, including accuracy, AUC, F1 score, and the time taken to train and validate the model across the grid search. This function is designed to be used in a supervised learning context, where the goal is to classify data points.

Parameters:
  • X_train (array-like) – Training data features.

  • X_test (array-like) – Test data features.

  • y_train (array-like) – Training data labels.

  • y_test (array-like) – Test data labels.

  • args (dict) – Additional arguments, typically from config.yaml.

  • verbose (bool) – If True, prints additional information during execution.

  • model (str) – Name of the model being used, default is ‘Decision Tree’.

  • cv (int) – Number of cross-validation folds. Default is 5.

  • criterion (list) – List of criteria to consider for splitting. Default is empty list.

  • max_depth (list) – List of maximum depths to consider. Default is empty list.

  • min_samples_split (list) – List of minimum samples required to split an internal node. Default is empty list.

  • min_samples_leaf (list) – List of minimum samples required to be at a leaf node. Default is empty list.

  • max_features (list) – List of maximum features to consider when looking for the best split. Default is empty list.

Returns:

A dictionary containing the evaluation metrics, best parameters, and time taken for training and validation.

Return type:

modeleval (dict)