This repository contains scripts used to train and deploy the classification analyses reported in Classifying ENU induced mutations from spontaneous germline mutations in mouse with machine learning techniques by Zhu, Ong and Huttley.
Inside a conda environment, run pip on the downloaded zip file.
$ pip install mutori.zip
Installation creates mutori and mutori_batch command line scripts. Command line help for mutori shows
$ mutori
Usage: mutori [OPTIONS] COMMAND [ARGS]...
mutori -- for building and applying classifiers of mutation origin
Options:
--help Show this message and exit.
Commands:
lr_train logistic regression training, validation,...
nb_train Naive Bayes training, validation, dumps...
ocs_train one-class svm training for outlier detection
performance produce measures of classifier performance
predict predict labels for data
sample_data creates train/test sample data
xgboost_train Naive Bayes training, validation, dumps...
Command line help for mutori_batch shows
$ mutori_batch
Usage: mutori_batch [OPTIONS] COMMAND [ARGS]...
mutori_batch -- batch execution of mutori subcommands
Options:
--help Show this message and exit.
Commands:
collate collates all classifier performance stats and...
lr_train batch logistic regression training
nb_train batch naive bayes training
ocs_train batch one class SVM training
performance batch classifier performance assessment
predict batch testing of classifiers
sample_data batch creation training/testing sample data
xgboost_train batch xgboost training
Must be in a tab delimited form, with a header line. The file will be read by pandas.read_csv. Required columns are: varid, variant identifiers; flank5 and flank3 are the 5' and 3' flanking sequences respectively; direction, mutation direction with values of form XtoY (X and Y are nucleotides).
For training, the file must also contain a response column containing either e/g (for ENU and spontaneous Germline respectively)
If the GC% is to be examined, a GC column is also required.
The column order does not matter.
These are saved in python's pickle format. Also saved are attributes defining the feature set against which the classifier was trained.
Stored in json format.
Done via the mutori_batch collate command, produces tab separated files of key performance metrics and summary statistics of each of those.
The BSD 3-clause license is included in this repo as well, refer to license.txt