Welcome to sbr’s documentation!
sbr is a set of useful functions and classes for modelling gene expression data with tensorflow.
Indices and tables
Full Reference
- sbr.compile.one_layer_multicategorical(input_size=None, output_size=None, learning_rate: float = 0.0001, dim: int = 1000, specificityAtSensitivityThreshold: float = 0.5, sensitivityAtSpecificityThreshold: float = 0.5, kernel_initializer=tensorflow.keras.initializers.HeNormal, bias_initializer=tensorflow.zeros_initializer, output_activation: str = 'softmax', isMultilabel: bool = True, seed=None, verbose: bool = True)
Compile a single layer multicategorical model.
Can use sbr.visualize.plot_loss_curve to see the metrics after fitting
- Parameters
input_size – Usually x_train.shape[1]; not required for compile, but for calling model.summary()
output_size – Number of classes in the one-hot-encoded target vector; usually y_train.shape[1]
learning_rate – Plan for this to be reduced during EarlyStopping checkpoints in the model training/fit
dim – Number of nodes to have in the hidden layer. Somthing half-way between input_size and output_size is a good choice, but if input_size is very big, the number may need to be smaller in order to reduce the number of trainable parameters and avoid over-fitting.
specificityAtSensitivityThreshold – With this percentage of sensitivity (e.g., detecting at least this many true positives), find the specificity (e.g., how many identified will actually be correct). This is a bit trickier for multivariate problems, see this blog artical on analyticsvidhya.com
sensitivityAtSpecificityThreshold – Same as above, but for specificity.
kernel_initializer – HeNormal initializer forces diversity of outcomes between trainings
bias_initializder – initialize biases
output_activation – Use softmax for multicategorical, one-hot encoded
isMultilabel – Should alwasy be True for multicategorical models
seed – used with tensorflow, numpy, python to make random number generator create reproducible results. Also use ‘seed’ in initializer,
verbose – If True, print model summary. Set to False if input_size = None to avoid error
- Returns
model of type tf.keras.model
Example usage:
>>> model = compile.one_layer_multicategorical(input_size=x_train.shape[1], output_size=y_train.shape[1], output_activation='softmax', learning_rate=0.0001, isMultilabel=True, dim=1000, specificityAtSensitivityThreshold=0.50, sensitivityAtSpecificityThreshold=0.50, seed = 42, verbose=True) Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= Input_BAD (BADBlock) (None, 1000) 18968000 _________________________________________________________________ output (Dense) (None, 26) 26026 ================================================================= Total params: 18,994,026 Trainable params: 18,992,026 Non-trainable params: 2,000 _________________________________________________________________
- class sbr.layers.BADBlock(*args: Any, **kwargs: Any)
Inherits from keras.layers.Dense. Dense layer followed by Batch, Activation, Dropout. When popular kwarg input_shape is passed, then will create a keras input layer to insert before the current layer to avoid explicitly defining an InputLayer.
This is a very good layer to use for gene expression data to increase stability and reduce trainable parameters.
Example 1:
Recreate this layer from its config:
>>> layer = BADBlock(units=1000) >>> config = layer.get_config() >>> new_layer = BADBlock.from_config(config)
Example 2:
Use in a model:
>>> import tensorflow as tf >>> from tensorflow.keras.layers import Dense >>> from tensorflow.keras.models import Sequential >>> from sbr.layers import BADBlock >>> model = Sequential() >>> model.add(BADBlock(units=1000, input_dim = 18963, activation='relu', dropout_rate=0.50, name="BAD_1")) >>> model.add(Dense(26, activation="softmax")) >>> model.summary() >>> model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy','mse'])
- sbr.evaluate.compare_predictions(y_test, y_pred=None, model=None, x_test=None, class_names=None, verbose=True)
Compares predictions with truth. If y_pred is None, predicts y_test from x_test using model.
- Parameters
y_test – targets
y_pred – values to compare to y_test; if null, uses model and x_test to create y_pred
model – the model to use model.predict; used to create y_pred if y_pred==None
x_test – the features to use with model to predict y_pred if y_pred not present.
class_names – an ordered list of class name strings that map to the np.argmax(y_test,axis=1) indices in y_test. If none, class indices will be reported instead of strng names.
verbose – if verbose, pairs are printed out (good if there aren’t a lot of mislabeled predictions)
- Returns
(y_pred, pairs)
y_pred: the predicted outcomes from x_test
pairs: list pairs of (<truth><false-prediction>) class names
- Exampe usage:
>>> y_pred, pairs = compare_predictions(model=model, x_test=x_test, y_test=y_test, class_names=class_names, verbose = True)
Number of test samples: 256 Mis-classifications: (<truth>,<false-prediction>) [('Esophagus', 'Blood Vessel'), ('Blood Vessel', 'Heart'), ('Adipose Tissue', 'Breast'), ('Salivary Gland', 'Esophagus')] [sbr.model.save_architecture] Model successfully saved at: data/model/gtex/manual/gtex_model.h5. Model: "sequential"
- sbr.evaluate.mislabeled_pair_counts(model, X, y, class_names, sample_ids=None, batch_size=1500, verbose=False)
For multicategorical models: creates a table of observed, predicted class names for the mispredicted observations. This tends to use a lot of memory on multiple runs in a jupyter notebook, with tensorflow 2.6. May need to restart the kernel on second run. If resources continue to be a problem after restarting the kernel, reduce the batch_size.
Assumes y_pred, y_obs are one-hot encoded and class_names matches the index predictions returned from np.argmax(y_pred)
- Parameters
model – used for model.predict
X – feature values
y – one-hot encoded true labels
class_names – ordered list of class_names
sample_ids – pass in this Series object to get back a table of pairs with their sample_ids
batch_size – number of samples to process in each step (to keep from swamping memory)
verbose – helps with debugging; messages each step/batch
- Returns
(pairs_counts, pair_id_map)
pairs_counts: Table with compound index ‘observed’,’predicted’ and one column, “counts”, with the count of all the
the samples in that observed/predicted mislabeled pair.
pair_id_map: None if sample_ids wasn’t passed in, otherwise returns a table with columns observed, predicted, sample_id
- Example Usage: Get the mislabeled counts
>>> mislabeled_counts, mislabeled = mislabeled_pair_counts(model=model, X=X, y=y, class_names=class_names, sample_ids = pd.Series(label_df["sample_id"]), batch_size=500) >>> mislabeled_counts
- Example Usage: Get the mislabeled samples
>>> m=mislabeled.reset_index() >>> m[m['observed']=="Lung"]
- sbr.evaluate.training_report(model, x_test, y_test, sensitivityAtSpecificityThreshold=None, specificityAtSensitivityThreshold=None, verbose=True)
Calls model.evaluate(x_test,y_test) and, if verbose, reports on the performance, then returns a performance object like the one returned by model.evaluate.
- Parameters
x_test – features
y_test – targets
verbose – if True, report to stdout
sensitivityAtSpecificityThreshold – If not None, and verbose, and this metric was captured in model.fit, report it to stdout
specificityAtSensitivityThreshold – see above
- Returns
A performance object
- Example Usage:
>>> performance = training_report(model, x_test, y_test, sensitivityAtSpecificityThreshold=sensitivityAtSpecificityThreshold, specificityAtSensitivityThreshold=specificityAtSensitivityThreshold, verbose=True)
Performance: Performance details: loss:0.07804308831691742 accuracy:0.984375 mse:0.0009617832256481051 precision:0.984375 recall:0.984375 auc:0.9988833665847778 SpecificityAtSensitivity:0.9998437762260437 SensitivityAtSpecificity:0.99609375 fp:4.0 fn:4.0 tp:252.0 tn:6396.0 Figure(500x500) Number of training samples: 2080 Number of validation samples: 256
…
- sbr.fit.multicategorical_model(model, model_folder, x_train, y_train, x_validation, y_validation, epochs=200, patience=4, lr_patience=2, lr_factor=0.1, batch_size=32, shuffle_value=100, seed=None, initial_epoch=0, train_verbose=1, checkpoint_verbose=1)
Fits the given model with the given hyperparameters and multi-categorical data, after computing class weights and shuffling the data. Writes checkpoint and final model weights to model_folder. Look under variables/variables.* for weights.
Assumptions
Model has been compiled and saved to f”{model_path}.h5” (e.g., data/model/gtex/manual/gtex_model.h5)
Targets are one-hot encoded
Features have been normalized
Tested with tensorflow v2.6.2, keras 2.6.0
- Parameters
model – a compiled model
model_folder – writable folder to store the checkpoint and final model weights
x_train – training features, see sbr.split for help
y_train – training targets, see above
x_validation – validation feature, see above
y_validation – validation feature, see above
epochs [200] – Number of epochs to train
patience [4] – Number of epochs with no improvement after which training will be stopped.
lr_patience [2] – Number of epochs with no improvement after which learning rate will be reduced.
lr_factor [0.1] – Factor by which the learning rate will be reduced. new_lr = lr * factor.
batch_size [32] – probably don’t change this
shuffle_value [100]
seed [None] – seed the random number generator for reproducibility
initial_epoch [0] – use this if you want to resume training at a particular epoch
train_verbose [0] – amount of information to print on each epoch. for 0: silent, 1: animated progress bar, 2: mentions epoch. For example:
0: <silent>
1:
[==================] Epoch 00015: val_loss improved from 0.06645 to 0.06611, saving model to data/model/gtex INFO:tensorflow:Assets written to: data/model/gtex/assets
2:
Epoch 1/10 checkpoint_verbose [1]: amount of information to print on each epoch about the checkpoint. 0: silent.
- Returns
history
A History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable). Use print(history.history.keys()) to see all the hist and print(history.history[‘val_loss’]) to print validation loss
- Example Usage:
>>> from sbr import fit >>> history=fit.multicategorical_model(model=model, model_folder ='data/model/gtex', x_train=x_train, y_train=y_train, x_validation=x_validation, y_validation=y_validation, epochs = 200, patience = 4, lr_patience = 2, checkpoint_verbose=1, train_verbose=0)
- Example Usage: Reload with:
>>> model = load_model('f{model_path}') >>> model.load_weights(f"{model_folder}")
- sbr.model.save_architecture(model, model_path: Optional[str] = None, file_name='model.h5', input_size=None, verbose=1)
Saves the given model to the given path and name. It’s a good idea to train and then run this in a notebook if possible so the train model is resident in memory because this function can be tried again in case it fails for some reason.
Note
Custom layer BADBlock will be loaded as part of the configuration.
Warning
THIS WILL OVER-WRITE ANY EXISTING MODEL.
- Parameters
model – model object for calling model.save
model_path – file path where model is to be written
file_name – name of the file, h5 format. Any exisiting file will be over-written.
input_size – if not None, attempts to check predictions on saved model are close to original model
verbose – 0: debug, 1:print out model summary. This may throw an error if model wasn’t compiled with a known input size
- Returns
True on success, False otherwise. Check the return to try again if it fails while model is still resident in memory.
Example usage:
>>> success = sbf.model.save(model, model_path="data/model/manual", file_name="model.h5", verbose=1) True
- sbr.preprocessing.dataset.multicategorical_split(X, y, sample_count_threshold=100, test_fraction=0.1, validation_fraction=0.1, verbose=True, batch_size=32, seed=None, shuffle=True)
Shuffles and splits X, y into test, train, validate; round dataset sizes to be a factor of batch_size.
Final dataset size is (sample_count_threshold * <number of classes>)
see also: sbr.preprocessing.gtex.dataset_setup
- Parameters
X – Features
y – multicategorical targets (more than one column)
sample_count_threshold – use about this many samples from each class
seed[None] – set this to make function deterministic/repeatable
shuffle[True] – probably don’t touch this. Shuffling the data really helps down-stream model training.
- Returns
(x_train, y_train, x_val, y_val, x_test, y_test)
- sbr.preprocessing.dataset.trim_list_size_to_batch_size_factor(batch_size=32, trim_list=None)
Trims the given list of multicategorical arrays down to a factor of the given batch_size. This can avoid errors during training when the dataset is very large, a small amount of data loss isn’t a factor, and retaining a specfic batch_size (e.g., of 32) is prefered .
- Parameters
trim_list – a list of arrays to be trimmed
batch_size[32] – probably leave this alone
- Returns
the same trim_list, but trimmed
- Example Usage:
>>> [x_train, y_train, x_val, y_val, x_test, y_test] = trim_list_size_to_batch_size_factor([x_train, y_train, x_val, y_val, x_test, y_test])